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L Real Party in Interest 

The real party in interest in the present appeal is Intel Corporation of Santa Clara, 
California,, the assignee of the present application. 

H. Related Appeals and Interferences 

There are no related appeals or interferences to appellant's knowledge that would 
have a bearing on any decision of the Board of Patent Appeals and Interferences. 

BDL Status of the Claims (independent claims shown in bold) 
Claims 1-7, 8-15, 19-20, 25, 32 and 38 are canceled. 

Claims 17 and 26-38 stand rejected under 35 USC § 112, second paragraph, as 
allegedly being indefinite. 

Claims 16-18, 26-31, 33-37 and 39-42 stand rejected under 35 USC § 103(a) as 
allegedly being unpatentable over US Patent 5,859,789 (Sidwell) in view of in view of 
Visual Instruction Set (VIS ™) User's Guide, Sun Microsystems, March 1997 (Sun). 

Claims 21-22 7 23-24, 33-34 and 43-44 stand rejected under 35 USC § 103(a) as 
allegedly being unpatentable over US Patent 5,859,789 (Sidwell) in view of in view of 
Visual Instruction Set (VIS ™) User's Guide, Sun Microsystems, March 1997 (Sun) and 
further in view of US Patent 5,721,697 (Lee). 

Non-final rejection of claims 16-18, 21-22, 23-24, 26-31, 33-37 and 39-44 is 
being appealed. 

42390P5943C -3- 
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IV. Status of Amendments 

A preliminary amendment, submitted by appellant on 11/6/2001 was entered. An * 
official response to a 6m Office Action mailed 8/19/2003 was submitted by appellant on 
1/20/2004 and was entered, A Final Office Action was mailed on 4/9/2004. Appellant 
responded with an amendment and official response after final on 6/9/2004, which was 
entered and an Advisory Action was mailed 7/9/2004. An RCE and official response, 
which was not accepted, were submitted by appellant on 10/11/2004. A Notice of Non- 
Compliant Amendment was mailed 10/25/2004. Appellant responded by submitting a 
corrected official response on 11/5/2004, which was not accepted. A second Notice of 
Non-Compliant Amendment was mailed 12/9/2004. Appellant submitted a second 
corrected official response on 12/20/2004, which was entered. A Non-final Office Action 
was mailed on 1/10/2005. A Notice of Appeal was transmitted on 6/10/2005, and an 
appeal ensued. Another amendment is being submitted, under 37 CFR § 41.33 and 
concurrent with the present appeal brief. 

A<;cordingly, the claims Stand as of the concurrently submitted amendment of 
8/10/2005, and are reproduced in clean form in the Claims Appendix. 
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V. Summary of Claimed Subject Matter 

Appellant's disclosure describes methods and apparatus for computing multiple 
absolute differences from packed data and summing the multiple absolute differences 
together to produce a result using an execution unit that also performs multiple multiply- 
add operations. According to one embodiment, a processor includes a decode unit to 
decode a packed sum of absolute differences (PSAD) instruction having an opcode format 
to identify a set of packed data. The decoder initiates a first set of operations responsive to 
decoding the PSAD instruction- An execution unit performs a first operation of the first set 
of operations initiated by the decode unit. According to another embodiment, the 
processor also executes instructions of the PENTIUM® microprocessor instruction set. 

According to another embodiment, the first set of operations comprises a packed 
subtract and write carry (PSUBWC) operation; a packed absolute value and read carry 
(PABSRQ operation; and a packed add horizontal (PAJDDH) operation. 

According to another embodiment; performing the first operation causes the 
execution unit to produce a first plurality of partial products in a multiplier having a 
plurality of partial product selectors, to insert elements of a first packed data into and 
substituting for bit positions of one or more of the first plurality of partial products by 
using partial product selectors corresponding to the bit positions, and to add the first 
plurality of elements together to produce a sum of the First plurality of elements. 

According to another embodiment, the decode unit also decodes a packed multiply- 
add (PMAD) instruction having a second format to identify a second set of packed data and 
initiates a second set of operations responsive to decoding the PMAD instruction. The 
execution unit also performs a second operation of the second set of operations initiated by 
the decode unit. 
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According to another embodiment, performing the second operation causes the 
execution unit to produce a second plurality of partial products in the multiplier having 
said plurality of partial product selectors, the second plurality of partial products 
comprising four distinct sets of partial products including a first, a second, a third and a 
fourth set of partial products corresponding to a first, a second, a third and a fourth product 
for elements of the second set of packed data, and to add the first and second sets of partial 
products together to produce a first distinct element of a packed result and to add the third 
and fourth sets of partial products together to produce a second distinct element of the 
packed result. 

Claim 26 sets forth a processor to execute instructions of the PENTIUM 
microprocessor instruction set 1 , the processor comprising: decode logic 2 to decode a 
packed sum of absolute differences (PSAD) instruction 3 having a first formal to identify a 
first set of packed data 4 , said decode logic to initiate a first set of operations on the first set 
of packed data responsive to decoding the PSAD instruction 2 ; execution Jogic to perform a 



1 "In one embodiment of the invention, the processor 105 supports the Pentium® microprocessor 
instruction set and the packed data instruction set 145. By including the packed data instruction set 145 into 
a standard microprocessor instruction set, such as the Pentium® microprocessor instruction set, packed data 
instructions can be easily incorporated into existing software (previously written for the standard 
microprocessor instruction set)." (Fig* 1, p, 9, line 20 through p. 10, line 1). Pentium® is a registered 
trademark of Intel Corporation, (p. 10, lines 6-7). 

2 'The decode unit 140 is used for decoding instructions received by the processor 105 into control signals 
and/or microcode entry points. In response to these control signals and/or microcode entry points, the 
execution unit 142 performs the appropriate operations. The decode unit 140 may be implemented using 
any number of different mechanisms (e.g., a look-up table, a hardware implementation, a PLA, etc.)." . 
(Fig* 1, p. 9, lines 8-13) 

3 "The decode unit 140 is shown including a packed data instruction set 145 for perforating operations on 
packed data. In one embodiment, the packed data instruction set 145 includes a PMAD instructions) 150 7 
a PADD instruction^) 1 5 U a packed subtract instruction^) (PSUB) 152, a packed subtract with saturate 
instruction(s) (PS UBS) 153, a packed maximum instruction (s) (PMAX) 154, a packed minimum 
instruction^) (PMIN) 155 and a packed sum of absolute differences instruction(s) (PSAD) 160."' (Fig, 1, p. 

9, fines 14-20) 

4 "In one embodiment of the invention, the execution unit 142 operates on data in several different packed 
(non-scalar) data formats. For example, in one embodiment, the exemplary computer system 100 
manipulates 64-bit data groups and the packed data can be in one of three formats: a "packed byte" format, 
a "packed word" format, or a "packed double-word" (dword) format. Packed data in a packed byte format 
includes eight separate 8-bit data elements. Packed data in a packed word format includes four separate 16- 
bit data elerruints and packed data in a packed dword format includes two separate 32-bil data elements." (p. 

10, lines 11-18) 

42390P5943C -6- 
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first operation of the first set of operations initiated by the decode logic ; and a bus to 
provide the first set of packed data to the execution logic for performing of the first 
operation. 

Claim 39 sets forth a processor comprising: decode logic 2 to decode a packed sum 
of absolute differences (PSAD) instruction 3 having a first format to identify a first set of 
packed data 4 , said decode logic to initiate a first set of operations on the first set of packed 
data responsive to decoding the PSAD instruction 2 , the first set of operations comprising: a 
packed subtract and write carry (PSUBWC) operation 6 ; a packed absolute value and read 
carry (PABSRC) operation 7 ; and a packed add horizontal (PADDH) operation 8 ; and 
execution logic to perform the first set of operations initiated by the decode logic . 

Claim 16 sets forth a processor comprising: a decode unit 2 to decode a plurality of 
packed data instructions including a packed sum of absolute differences (PSAD) 
instruction 3 having a first format to identify a first set of packed data 4 , and a packed 
multiply-add (PMAD) instruction 3,9 having a second format to identify a second set of 



5 "FTG- 1 illustrates that the processor 105 includes a decode unit 140, a set of registers 141, an execution 
unit 142, and an iniernal bus 143 for executing instructions. Of course, the processor 105 contains 
additional circuitry, which is not necessary to understanding the invention. The decode unit 140, the set of 
registers 141 and the execution unit 142 are coupled together by the internal bus 143." (Kg. 1, p. 9, lines 4- 

6 "In step 50G, the first operation is a packed subtract and write carry (PSUBWC) operation. For example, 
in a PSUBWC F D, E operation, each packed data element Fi of the packed byte data F is computed by 
subtracting the packed daia element Ei of the packed byie data E from the corresponding packed data 
clement Di of the packed byte data D, Each packed daia element in the packed byte data D, E, and F 
represent an unsigned integer. Each carry bit Ci of a set of carry bits C is stored. Each carry bit Ci 
indicates the sign of the corresponding packed data elemenL Fi/* (Fig. 5, p* 13, lines 5-11) 

7 "In step 510, the second operation is a packed absolute value and read carry (PABSRC) operation. For 
example, in a PABSRC G <- 0, F operation, each packed data element Gi of a packed byte data G is 
computed by adding a packed data element Fi of the packed byte data F to a zero 501 (if the carry bit Ci 
indicates die corresponding packed daia element Fi is non-negative) and subtracting the packed data 
element Fi from the zero 50 1 (if the cany bit Ci indicates the corresponding packed data element Fi is 
negative)." (Fig. 5, p. 13, lines 12-18) 

8 "In step 520, the third operation is a packed add horizontal (PADDH) operation. For example, in a 
PADDH R G, 0 operation, a PMAD circuit is used to produce the result RS having a field that 
represents the sum of all of the packed data elements of packed byte data G as described with reference to 
FIGS. 11, 12 and 13 below. The PADDH operation is also referred to as a horizontal addition operation." 
(Fig- 5, p. 13, line 21 through p, 14, line 2) 

9 'TIG. 2 illustrates one embodiment of the PMAD instruction 150. Each packed data element Ai of a 
packed word data A is multiplied by the corresponding packed data element Bi of a packed word data B to 

42390P5943C -7- 
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packed data , said decode unit to initiate a first set of operations on the first set of packed 
data responsive to decoding the PSAD instruction and to initiate a second set of operations 
on the second set of packed data responsive to decoding the PMAD instruction 2 ; and an 
execution unit 2 ' 10 to perform a first operation of the first set of operations initiated by the 
decode unit and to perform a second operation of the second set of operations initialed by 
the decode unit 2 . 

Q aim 17 sets forth the processor of Claim 16, wherein the decode unit further 
decodes a plurality of instructions of a PENTIUM microprocessor instruction set 1 . 

Claim 21 sets forth the processor of Claim 16, wherein performing the first 
operation causes the execution unit to: produce a first plurality of partial products in a 
multiplier 11 having a plurality of partial product selectors 12 ; insert an element of a first 
plurality of elements of a first packed data into and substituting for bit positions of one or 
more of the first plurality of partial products 13 by using partial product selectors 
corresponding to the bit positions 14 ; and add the first plurality of elements together to 



produce doublewoid products that arc summed by pairs to generate the two packed data elements T 0 and T] 
of a packed dword data T. Thus, T 0 is AjBj + A 2 B 2 and Ti is A3B3 + A4B4. As illustrated, the packed data 
elements of packed dword data T are twice as wide as the packed data elements of the packed word data A 
and the packed word data B." (Fig* 2, p* 11, lines 9-15) 

10 "fig, 1 1 illustrates one embodiment of a PADDH apparatus of the present invention. . . . When the 
CNTR2 signal is deasserted, a PADDH apparatus 1150 performs the PMAD instruction 150." (see Figs. 11 
& 12, p. 20, line 9 through p. 26, line 22) 

n "The portions of the eight selected partial products of the first sixteen partial products and all the bit 
positions of the remaining partial products on the bus 1 101 and the bus 1 102 are generated (using prior art 
partial product selectors or PADDH partial product selectors, for example) as described in the case of the 
CNTR2 signal being deasserted." (Fig. 11, p. 22, lines 20-24) 

12 "In one embodiment, die set of 16x16 multipliers 1100 use multiple partia] product selectors employing 
Booth encoding to generate partial products. Each partial product selector receives a portion of the 
multiplicand and a portion of the multiplier and generates a portion of a partial product according to well- 
known methods." (Fig. 11, p. 21, lines 9-12) "FIG. 13 illustrates one embodiment of a PADDH partial 
product selector of the present invention," (see Fig, 13, p* 26, line 22 through p. 2$, line 2) 

13 "When the CNTR2 signal is asserted, certain partial product selectors (PADDH partial product selectors) 
within the set of 16x16 multipliers 1100 are configured to insert each packed data element Gi into a portion 
of one of the first sixteen partial products." (Fig. 11, pv 22, lines 9-11) 

14 'The PADDH partial product selectors are configured to insert the packed data element Go at A10-A17, 
the packed data element Gi at B08-B15, the packed data element G 2 at C06-C13, the packed data element 
G3 at D04-DJ 1, the packed data element at 110-117, the packed data element G5 at J08-J15, the packed 
data element G e at K06-K 1 3, and the packed data element G 7 at L04-L1 1 - H (Fig. 12, p. 24, lines 21-25) 

42390P5943C -8- 
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produce a first result including a field comprising a sum of the first plurality of elements, 
said field having a least significant bit 15 . 

Claim 23 sets forth a processor comprising: a decode unit 2 to decode a plurality of 
packed data instructions including a packed sum of absolute differences (PS AD) 
instruction 3 having a first format to identify a first set of packed data 4 , and a packed 
multiply-add (PMAD) instruction 3 ^ having a second format to identify a second set of 
packed data 4 , said decode unit to initiate a first set of operations on the first set of packed 
data responsive to decoding the PSAD instruction and to initiate a second set of operations 
on the second set of packed data responsive to decoding the PMAD instruction 2 ; and an 
execution unit z,,w to perform a first operation of the first set of operations initiated by the 
decode unit and to perform a second operation of the second set of operations initiated by 
the decode unit 2 ; wherein performing the first operation causes the execution unit to: 
produce a first plurality of partial products in a multiplier 11 having a plurality of partial 

■J A 

product selectors , insert an element of a first plurality of eJements of a first packed data 
into and substituting for bit positions of one or more of the first plurality of partial 
products 13 by using partial product selectors corresponding to the bit positions 14 , and add 
the first plurality of elements together to produce a first result including a field comprising 
a sum of the first plurality of elements, said field having a least significant bit 15 ; and 
wherein performing the second operation causes the execution unit to: produce a second 
plurality of partial products in the multiplier 16 having the plurality of partial product 



"The CSA tree with CLA 1 1 10 is coupled to receive the first sixteen partial products On the bus 1 101 
and generate the sum of the first sixteen partial products on the bus 1 103. The sum of the first sixteen 
partial products on the bus 1 103 includes the sum all of the packed data elements of the packed data G in a 
field within the result (see FIG. 12). . . . The shifter 1 130 performs a right shift operation on the result RS to 
produce the result R having the field representing the sum all of the packed data elements of packed byte 
data G aligned with the least significant bit of the result R_ In one embodiment, a right shift of RS by 1 0 
bits is used to generate the result R_" (see Figs. 11, p. 23, lines 5-19) 

15 The set of 16x16 multipliers 1 100 multiply each packed data clement Ai of the packed word data A 
received on the bus 1 140 with the corresponding packed data clement Bi of the packed word data B 
received on the bus 1141 to produce thirty-two 18-bit partial products using radix 4 multiplication/' 
(Fig. 11, p. 20, line 13 through p. 21, line 3) 

42390P5943C -9- 
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selectors 12 , the second plurality of partial products comprising four distinct sets of partial 
products including a first set of partial products corresponding to a first product for 
elements of the second set of packed data, a second set of partial products corresponding to 
a second product for elements of the second set of packed data, a third set of partial 
♦ products corresponding to a third product for elements of the second set of packed data, 
and a fourth set of partial products corresponding to a fourth product for elements of the 
second set of packed data 17 , and add the first set of partial products together with the 
second set of partial products to produce a first distinct element of a packed result and add 
the third set of partial products together with the fourth set of partial products to produce a 
second distinct element 18 of the packed result. 



'The eight partial products corresponding to the product of Ao and B 0 and the eiglu partial products 
corresponding to the product of At and B^ (the first sixteen partial products) are produced on a bus 1 101. 
The eight partial products corresponding to the product of A 2 and B 2 and the eight partial products 
corresponding to the product of A 3 and B 3 (the second sixteen partial products) are produced on a bus 

1102. " (Fig. 11, p. 21, lines 3-8) 

M "A carry-save adder (CSA) tree with carry lookahead adder (CLA) 1 1 10 is coupled to receive the first 
sixteen partial products on the bus 1101 and generate the sum of the first sixteen parti al products on a bus 

1 1 03, The sum of the first sixteen partial products on the bus 1 103 is the sum of the product of Aq and B 0 
and the product of A] and B|. The CSA tree with CLA 1 1 20 is coupled to receive the second sixteen partial 
products on the bus 1102 and generate the sum of the second sixteen partial products on a bus 1 104. The 
sum of the second sixteen partial products on the bus 1 103 ihe sum of the product of A 2 and B 2 and the 
product of A 3 and B3." (Fig. 11* p. 21, lines 13-20) 

42390P5943C -10- 
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VL Grounds of Rejection to be Reviewed on Appeal 

A. Claims 17 and 26-38 stand rejected under 35 USC § 1 12, as allegedly being 
indefinite. 

B. Claims 16-20, 25, 26-38 and 3<M2 stand rejected under 35 USC § 103(a) as 
allegedly being unpatentable over US Patent 5,859 7 789 (Sidwell) in view of in 
view of Visual Instruction Set (VIS ™) User's Guide, Sun Microsystems, March 
1997 (Sun). 

C. Claims 21-24, 33-34 and 43^4 stand rejected under 35 USC § 103(a) as 
allegedly being unpatentable over US Patent 5,859,789 (Sidwell) in view of in 
view of Visual Instruction Set (VIS ™) User's Guide, Sun Microsystems, March 
1997 (Sun) and further in view of US Patent 5,721,697 (Lee). 

Yfl. Argument 

A. 35U.S.C. § 112 REJECTIONS 

Chums 17 and 26-38 stand rejected under 35 USC § 1 12, second paragraph, as 
allegedly being indefinite, the Office Action (8.2) stating that through use of the 
trademark, PENTIUM®, the claim fails to identify any particular material or product, and 
as a result, renders the claim indefinite. 

1- Claims 17 and 26-38 Are Not Indefinite, 

With regard to Claims 17 and 26, the Office Action mailed January 10, 2005, 
states that appellant is "overlooking the clear language of MPEP 2173.05(u) " Appellant 
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respectfully points out that MPEP 2173.05(u) states that (emphasis supplied): 

"The presence of a trademark or trade name in a claim is not. peT se. improper under 35 USC 8 
112, second paragraph, but the claim should be carefully analyzed to determine how the mark or 
name is used in the claim, li is important to recognize that a trademark or trade name is used to 
identify a source of goods, and not the goods themselves. . . . If the trademark or trade name i$_used 
in a claim to identify or describe a particular material or product the claim does not comply with 
the requirements of 35 U.S.C. 112. second paragraph. ... Does its presence in the claim cause 
confusion as to the scope of the claim?" 

The Office Action (8.3) incorrectly characterizes a declaration submitted October 
11, 2004, suggesting appellant attested that the trademark, PENTIUM, specifically 
identifies a particular product. Appellant respectfully rebuts the suggestion. Appellant 
states instead that the phrase, "instructions of the PENTIUM microprocessor instruction 
set," had, -<wd has, a fixed and definite meaning, and would apprise one skilled in the art 
of claims 1 7 and 26' s respective scope. 

Appellant submits that, as the Office Action (8.2) correctly states, the trademark, 
PENTIUM, is not being used as descriptive of a particular material or product, which 
would not be in compliance with 35 USC § 112, second paragraph. Rather, what is set 
forth is, "instructions of a PENTIUM microprocessor instruction set," which is 
descriptive of the source of a publicly disclosed microprocessor instruction set. Appellant 
submits that the microprocessor instruction set associated with that particular source is a 
well established de facto standard of compatibility. 

Evidenced by arguments found in the Office Action (8,2), the Examiner 
apparently understands from MPEP 2173.05(u) that even if the trademark, PENTIUM, is 
descriptive of the source of a microprocessor instruction set, and is not being used as 
descriptive of a particular material or product, the use of the trademark, PENTIUM, in 
claims 17 <ind 26 would still, necessarily, be improper. Appellant respectfully disagrees. 

Appellant submits that MPEP 2)73.05(u) cites a decision of the Board of Patent 
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Appeals and Interferences in Ex parte Simpson, 218 U.S.RQ. 1020 (Bd. App. 1982), 
where the term t<L Hypalon" was being used (improperly) as a noun in a claim to describe 
the physical and/or other properties of a material. The Board ruled that, u [t]he claim 
scope [wasj uncertain as regards the material which forms the 'Hypalon/" 

Such is not the case with the present claims 17 and 26 where the trademark, 
"PENTUIM," is being used as an adjective descriptive of the source of a well known, 
publicly disclosed, microprocessor instruction set. 

Appellant respectfully argues that the presence of a trademark or trade name in a 
claim is not improper under 35 USC § 112, second paragraph, if its presence in the claim 
does not cause confusion as to the scope of the claim. 

Claim 17, for example, sets forth: 

17. (Previously Presented) The processor of Claim 16, wherein the decode unit further 
decodes a plurality of instructions of a PENTIUM microprocessor instruction 
set. 

Claim 26, for example, also sets forth: 

26. (Previously Presented) A processor to execute instructions of the PENTIUM 
microprocessor instruction set, the processor comprising: 

decode logic to decode a packed sum of absolute differences (PSAD) 
instruction having a first format to identify a first set of packed data, said decode 
logic to initiate a first set of operations on the first set of packed data responsive 
to decoding the PSAD instruction; 

execution logic to perform a first operation of the ftrsc set of operations 
initiated by the decode logic; and 

a bus to provide the first set of packed data to the execution logic for 
performing of th e first operation. 

Chdms 17 and 26 set forth, respectively, a decode unit to decode and a processor 
to execute instructions of the PENTIUM microprocessor instruction set. Appellant 
respectfully submits that the PENTIUM microprocessor instruction set identifies the 
source of the instruction set in such a way that the scope of the subject matter embraced 
by the claim is fixed and definite. Further, the PENTIUM microprocessor instruction set 



42390P5943C -13- 
PAGE 26/198 * RCVD AT 010/2005 9:58:36 PM [Eastern Daylight Time] * SVR:USPT0-EFXRF-5/1 * DNIS:2738300 * CSID:408 720 9397 * DURATION (mm-ss):41-38 



08/10/2005 19:02 FAX 408 720 9397 



BST&Z 



BJ027 



was known generally to those skilled in the art at the time the original application was 
filed. For example, numerous software engineers and hardware engineers relied and still 
rely upon the publicly available definition of the PENTIUM raicroprocessor instruction 
set in order to conduct business and to plan engineering projects. 

Appellant respectfully submits as further evidence of the above conclusion the 
accompanying exhibits (emphasis supplied): 

i. The "Pentium® Processor Family Developer's Manual, Vol. 3: Architecture and 

Programming Manual;* 1995, pp. 25-165 and 25-166, cited bv the Examiner in the Office Action 
(8.5) as extrinsic evidence for a common know ledge of the PENTIUM microprocessor instruction 
set by one of ordinary skill in the art. 

iL A November 1997 anicle by Eric Trait from BYTE, which discusses a Macintosh 
application that employs a Pentium instruction-set emulator, complete with MMX'™ 1 
instructions.'* 

ui. A definition of AMD from wordlQ.com, which explains (in the History section, 
paragraph 7) that at some time about one year after AMD purchased NexGeo in 1996, "the K6 
[processor! translated the Pentium compatible x86 instruction set to RISC-like micro-instructions." 
iv_ John Savill's FAQ (Frequently Asked Questions) for Windows web page, dated 
September 3, 1999, which asks, "Do 1 really need 166Mhz Pentium processors to run SQL Server 
7.0?" The answer given states, <4 No. But you DO need a 100% PENTIUM compatible chip - 
which rules out some Cyrix and IBM processors." The page further explains (in paragraph 3) that 
"speed of the processor doesn't matter as long as it runs the full gentium instruction set ." 

v. A current product description of a single-board computer from SBS technologies, which 
includes a "Pentium compatible Ceode GX1 processor." 

vi. A Department. of Energy (hq.doe.gov) description of the Hardware & System 
Requirements for Microsoft Windows 2000 and Microsoft Office 2000 by JT Standards Manager, 
Carol Blackston, requiring a "133 MHz or higher Pentium-compatible CPU" for Windows 2000 
Professional, a 166 MHz Pentium-compatible CPU or higher" for Office 2000 Premium, and a "75 
MHz Pentium-compatible CPU or higher" for Office 2000 Professional or Office 2000 Standard- 

vii. Microsoft requirements for a Microsoft Operations Manager Server, a Database Server, a 
Reporting Server, or a SQL Server 2000 Reporting Services Server, listed as a "PC with 550 MHz 
or higher Pentium-compatible ;" an Administrator and Operator Console, listed as a **PC with 500 
MHz or higher Pentium-compatible ;" and a Managed Computer, listed as a "PC with 200 MHz or 
higlier Pentium-compatible ." 

viii. An article by Taran Rampcrsad from the Free Software Consortium (FSC) dated March 
26, 2004, describing the basic system requirements of OpenOffice under Windows (98, NT, 2000, 
XP) including a ' 'Pentium-compatible PC." 

ix. An article by Thomas Latuske posted June 8, 2004, describing two ways to retrieve the 
processor-speed and stating (in paragraph 1) that, "If you want to use the function to calculate the 
speed (frequency), you have to use it with a Pentium instruction set compatible processor." 

According to the above references (i, ii and iii) prior art disclosed the PENTIUM 
microprocessor instruction set, the instruction set was known to persons skilled in the art; 
and it was readily obtainable at the time the application was filed. As such, use of the 
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trademark is not improper under 35 USC § 112. Leutzinger v. Ladd, 222 F. Supp. 681, 
682. Appellant concludes from the above references (i through ix) that as of the filing 
date in March of 1998, the phrase, "instructions of the PENTIUM microprocessor 
instruction set," had (and continues to have today) a fixed and definite meaning, and 
therefore, wouJd apprise one skilled in the art of the respective scope of claim 17 and of 
claim 26. 

Appellant notes that in MPEP 608.0 l(v), par. 6, it states (emphasis supplied): 

"Jf Ihe product to which the trademark refers is set forth in such language thai its identity is clear, 
the examiners are authorized to permit the use of the trademark if it is distinguished from common 
descriptive nouns by capitalization. If the trademark has a fixed and definite meaning, kconsiitutes 
sufficient identification unless some physical or chemical characteristic of the article or material is 
involved in the invention." 

Therefore, when the trademark has a fixed and definite meaning, it constitutes 
sufficient identification in accordance with MPEP 608.01 (v). 

The amount of detail required to be included in claims depends on the particular 
invention and the prior art, and is not to be viewed in the abstract but in conjunction with 
whether the specification is in compliance with the first paragraph of section 112. 
Chemcast Corp. v. Arco Industries Corp., 854 F.2d 1328 (Fed. Cir. 1988). 

The present specification discloses (p. 9, line 20 through p. 10 7 Jine 1) that: 

''In one embodiment of the invention, the processor 105 supports the Pentium® microprocessor 
insuuetion sec and the packed data instruction set 145. By including the packed data instruction 
set 145 into a standard microprocessor instruction set, such as the Pentium® microprocessor 
i nsnucrion set, packed data instructions can be easily incorporated into existing software 
(previously written for the standard microprocessor instruction set)." 

The Court of Customs & Patent Appeals, considering the use of a trade name, 
"Pliolite in a claim in conjunction with the first paragraph requirements of section 112, 
where (appellant respectfully takes note that) neither the composition of "Pliolite" or 
"Pliofonn nor a method of preparing them, nor who manufactured or sold them, was 
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disclosed in the original application* held that, "A fair interpretation of the facts herein 
leads to the conclusion that the Goodyear products Tiiofo^^3 , and Tliolite' were on the 
market and were known generally to those skilled in the art of chemistry at the time the 
original application was filed by appellants. With the information contained in the 
original application, therefore, it was possible for those skilled in the art at that time to 
practice appellants 1 invention, and thus it follows that the disclosure therein was 
sufficient" In re Gebauer-Fuelnegg, 50 USPQ 125 (C.C.P.A. 1941). 

Therefore, appellant respectfully submits that the specification has set forth a full 
and clear description of the claimed subject matter in sufficient detail to conclude that 
appellant had possession of the claimed invention, further to enable one skilled in the art 
to practice the claimed invention, and finally to apprise one skilled in the art of claims' 
respective scope. 

In addition, the Office Action (8.2) maintains that there are at least ten (10) 
different particular microprocessors produced by Intel Corporation which carry the 
trademark, PENTIUM, many of which contain different instruction sets- While appellant 
disagrees that these particular microprocessors contain wholly different instruction sets, 
appellant respectfully submits that a claim is not indefinite sirapJy because it covers' a 
number of possible embodiments. 

The Court of Customs & Patent Appeals in considering the expression "organic 
and inorganic acids," which was alleged to be indefinite and of uncertain scope, held that, 
"Although there are undoubtedly a large number of acids which come within the scope of 
'organic and inorganic acids/ the expression is not for that reason indefinite." In re 
SkolU 187 USPQ 481 (CCPA 1975). 
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Appellant therefore submits that Claims 17 and 26 set out and circumscribe 
subject matter with a sufficient degree of precision and particularity to apprise one of skill 
in the ait of each claim's respective scope. 

Accordingly in light of the argument presented above, appellant respectfully 
submits that claims 17 and 26-38 are not indefinite and are 7 therefore, in compliance with 
35 USC § 1 12, second paragraph. 
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B. FIRST 35 U.S.C. § 103(a) REJECTIONS 

Claims 16-20, 25, 26-38 and 39-42 stand rejected under 35 USC § 103(a) as 
allegedly being unpatentable over US Patent 5,859,789 (hereafter "SidwelD in view of 
Visual Instruction Set (VIS ™) User's Guide, Sun Microsystems, March 1997 (hereafter 
"Sun"). 

L Claims 17 and 26-29 Are Not Obvious. 

With regard to Claims 17 and 26-29, the Office Action of Aug. 19, 2003 (8 of 
paper no. 5) states that it would have been obvious to make a combined system of Sidwell 
and Sun, perform Sun's packed sum of absolute differences, compatible with the 
PENTIUM® microprocessor instruction set Appellant respectfully disagrees. 

First, in determining the scope and content of the cited references with regard to 
the instant claims at issue, appellant respectfully submits that Sidwell is directed to an 
arithmetic unit for packed arithmetic. The arithmetic unit of Sidwell is comprised of a 
collection of separate packed arithmetic execution units, each responsible for some subset 
of packed arithmetic instructions (Fig. 2, col. 5, lines 15-19). Two source operands for 
the packed arithmetic units 70-80 are supplied along the Source 1 and Source 2 busses 52 T 
54 (col. 5, lines 26-27). Sidwell discloses that one separate packed arithmetic execution 
unit (the multiply-add unit 76) is capable of executing a single instruction, the result of 
executing 1hat instruction being to multiply together respective pairs of objects from two 
operands and to add together the results to provide a final result (col. 7, lines 21-31, 
emphasis supplied). Sidwell discloses that another separate packed arithmetic execution 
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unit (the obvious packed arithmetic unit 80) performs the addition, subtraction, 
comparison and multiplication of packed numbers (Figs. 4 and 5, col. 5, line 50 through 
coL 6, line 47). 

Sun is directed to a set of visual instructions thai are used primarily to write 
graphics and multimedia applications (p. 41, first paragraph). One of these instructions 
(the vis_pdist() instruction) accumulates the absolute values of differences into a 
destination accumulator (p. 87, last paragraph). It is also shown by Sun that there is not a 
sum of absolute values without accumulation and therefore the accumulator must be 
initialized to zero prior to beginning execution of the vis jjdistO instructions (p. 88, line 
9, 4.7.1 1 Example). The vis_pdist() instruction of Sun has three source operands, one of 
which is also a destination and it is necessary for the accumulating register, accumulator* 
to appear both as an argument and as the receiver of the return value (p. 88, first 
paragraph). 

Next, appellant respectfully points out some of the differences between the cited 
references and the instant claims at issue. Claim 17, for example, sets forth: 

17. (Previously Presented) The processor of Claim 16, wherein the decode unit further 
decodes a plurality of instructions of a PENTIUM microprocessor instruction 
set. 

And Claim 16 sets forth: 

] 6. (Previously Presented) A processor comprising: 

a decode unit to decode a plurality of packed data instructions including a 
packed sum of absolute differences (PS AD) instruction having a first format to 
identify a first set of packed data, and a packed multiply- add (PMAD) 
instruction having a second format to identify a second set of packed data, said 
decode unit to initiate a first set of operations on the first set of packed data 
responsive to decoding the PSAD instruction and to initiate a second set of 
operations on the second set of packed data responsive to decoding the PMAD 
instruction; and 

an execution unit to perform a first operation of the first set of operations 
initiated by the decode unit and to perform a second operation of the second set 
of operations initialed by the decode unit. 
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In the Office Action of Aug. 19, 2003 (7 and 8 of paper no. 5) the Examiner 
asserts that it would have been obvious to combine Sun's packed sum of absolute 
differences to Sidwell's system because Sidwell taught that the packed arithmetic unit 
performed additional operations (col. 5, lines 15-22) and Sun taught that a packed sum of 
absolute differences instruction was beneficial in accelerating motion compensation to 
support real-time video compression (p. 88) and then to make the combined system of 
Sidwell and Sun, perform Sun's vis_j>dist() instruction, compatible with the PENTIUM® 
microprocessor instruction set because the PENTIUM microprocessor instruction set is 
the most widely used microprocessor instruction set in the world. 

The multiply-add instruction of Sidwell (muladd2ps) employs three operands (e.g. 
see SidweJJ, col. 8, lines 24 and 34-37). The vis_pdist() instruction of Sun also requires 
three operands one of which is both a source and a destination (e.g. see Sun, p. 88 ? line 
10). Therefore, one difference between the claimed decoder of instructions of the 
PENTIUM microprocessor instruction set and the expected properties of the combined 
system of Sidwell and Sun is the absence of an expected third operand. 

Appellant respectfully submits that since the PENTIUM microprocessor 
instruction set has a well known opcode format, which permits two operands, one of the 
operands acting both as a source operand and a destination operand. To decode and 
execute the vis_pdist() instruction of Sun having three source operands in a processor for 
executing two-operand instructions of the PENTIUM microprocessor instruction set is 
not suggested by either of the cited references. 

Fuilher, since the vis_pdist() instruction of Sun explicitly requires at least three 
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sources (two sets of pixels and one accumulator) the use of a two-operand format would 
be unexpected. Yet, because the instructions of the PENTIUM microprocessor 
instruction set require only two sources, one of which i$ also a destination, the data path 
for the packed sum of absolute differences may be 75% as wide as one requiring three 
sources and the number of read ports required in the register file may be 66% as many as 
would otherwise be required for reading three source registers. Such reductions are 
statistically significant. Moreover, in order for a decoder of the PENTIUM 
microprocessor instruction set to be adapted to also decode a new three-operand 
instruction, significant modifications and increased design complexity would be required, 
most probably introducing additional delays to critical timing paths for high frequency 
designs. Such considerations are also of great practical significance in the field of 
microprocessor design. 

The Final Office Action of April 9, 2004 (7.3 of paper no. 8) states that there are 
many PENTIUM instructions that while having only two programmer specified operands, 
make use of one or more additional implicit operands, and so, as a result that instruction 
is effectively a three or more operand instruction, citing as an example the IMUL 
instruction. 

The Office Action of Jan. 10, 2005 (8-5) maintains the assertion and provides, as 
extrinsic evidence for common knowledge in the art of the PENTIUM microprocessor 
instruction set, a reference, "Pentium® Processor Family Developer's Manual, Vol. 3: 
Architecture and Programming Manual," 1995, pp. 25-165 and 25-166, showing IMUL 
instructions that uses implicit operands. 

Appellant respectfully notes that the three IMUL instructions listed that use 
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implicit operands use only one programmer specified operand and an implicit destination 
register wliere the lower half of the destination register is used as a source (p. 25-165, 
lines 3-5). Thus, the 1MUL instructions with implicit operands are effectively, still, two- 
operand instructions. 

Therefore, appellant respectfully submits that since the PENTIUM microprocessor 
instruction set has an opcode format that permits two operands, one of those operands 
acting both as a source operand and a destination operand; and since the vis_pdist() 
instruction of Sun requires three source operands, one of which is also a destination (p. 
87, 4.7.1 1 Syntax and Description); a sum of absolute differences instruction with a two- 
operand opcode format would not perform the operation defined by the vis_pdist0 
instruction of Sun without modification. No suggestion of such modification is provided 
either by the cited references or by a common knowledge of IMUL in one of ordinary 
skill in the art. 

For example, Sun discloses that in the visj>di$t() instruction, one source is also 
an accumulating destination register. Therefore, the vis_pdist() instruction of Sun 
computes an accumulation of current and prior absolute differences rather tban a sum of 
the absolute differences on a first identified set of packed data as set forth in the instant 
claims at issue. There is no accumulation of prior absolute differences in what appellant 
has done. Thus, appellant submits that an absence in the claimed invention of the 
expected accumulation of prior absolute differences from the combined system of Sidwell 
and Sun is evidence of nonobviousness. 

Additionally, Sun discloses that in the vis _pdist() instruction, " it is necessary for 
the accumulating register to appear both as an argument and as the receiver of the return 
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value" (p, 88, first paragraph, emphasis supplied). Thus, Sun teaches away from an 
implicit source that is also the destination register, which is precisely the technique 
employed in the implicit-operand IMUL instructions, of the PENTIUM microprocessor 
instruction set. Therefore, it would not be obvious to combine the vis_pdist() instruction 
of Sun with the implicit operand technique used in the IMUL instructions, when Sun, 
itself, teaches away from such a technique. 

Further, while the Office Action (8.5) maintains that it would have been obvious 
to use implicit operands, as in the IMUL instruction of the PENTIUM microprocessor 
instruction set, for the combined system of Sidwell and Sun to perform Sun's packed sum 
of absolute differences; appellant respectfully submits that the packed sum of absolute 
differences (PSAD) instruction of the present application does not have an implicit 
operand Thus, appellant submits that the absence of an expected implicit operand is also 
evidence of nonobviousness. 

The claims set forth a packed sum of absolute differences (PSAD) instruction 
having a first format to identify a first set of packed data, initiating a first set of 
operations on the first set of packed data responsive to decoding the PSAD instruction, 
and execution logic to perform a first operation of the first set of operations. 

Since Sun teaches away from an implicit source that is also the destination 
register, the alleged two-operand combination of Sidwell and Sun would necessarily fail 
to identify some of the first set of packed data, and instead that source of packed data 
would necessarily have to be the implicit operand. 

The packed sum of absolute differences instruction of the present application has 
a format to identify the first set of packed data on which to perform the packed sum of 
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absolute differences. There is no implicit source operand in what appellant has done. 

Appellant respectfully submits that ihe alleged combination of references should 
not be considered obvious if it does not fairly disclose doing what appellant has done. 

The general rule applicable to a rejection based on a combination of references 

was stated in In re Schaffer, 108 USPQ 326, 328-29 (1956): 

[References] may be combined for the purpose of showing that a claim is, unpatentable. However, 
they may not be combined indiscriminately, and to determine whether the combination of 
references is proper, the following criterion is often used: namely, whether the prior art suggests 
doing what an applicant has done. ... [It] is not enough for a valid rejection to view the prior art in 
retrospect once an applicant's disclosure is known. The art applied should be viewed by itself to 
sec if it fairly disclosed doing what an applicant has done. 

Finally, Sidweli's system provides no path for an accumulator input to packed 
arithmetic unit 6, either from result bus 56 or as a third source operand to packed 
arithmetic unit 6 (Figs. 1, 2, 4, and 6; col. 5, line 15 through col 7, line 53). Therefore, 
SidwelTs system could not perform Sun's packed sum of absolute differences without 
significant modifications to permit a third source operand for packed arithmetic 
instructions. Appellant respectfully submits that no suggestion for such modifications is 
provided by Sidwell; and even if the modifications were made to Sidwell to permit a third 
source operand for packed arithmetic instructions, it would be nonobvious, without the 
aid of appellant's disclosure to view the prior art in retrospect, to perform an operation 
requiring such a third source operand but using only a two-operand opcode format as in 
the PENTIUM microprocessor instruction set. 

Therefore, appellant respectfully submits that without viewing the prior art in 
retrospect with the aid of appellant's disclosure, no suggestion is provided by Sidwell or 
Sun for doing what appellant has done. 

Accordingly in light of the above arguments, Claims J 7 and 26-29, are not 

42390P5943C -24- 



PAGE 37/198 * RCVD AT 8/10/2005 9:58:36 PM [Eastern Daylight Time) * SVR:USPT0-EFXRF-5/1 * DNIS:2738300 * CSID:408 720 9397 * DURATION (mn«s):41 ^38 



08/10/2005 19:05 FAX 408 720 93 97 BST&Z 0038 



obvious in view of the cited references. 



-25- 



42390P5943C 



PAGE 38/198 * RCVD AT 8110/2005 9:58:36 PM [Eastern Daylight Time] * SVItUSPTO-EFXRF-511 * DNIS:2738300 % CSID:408 720 9397 * DURATION (mm-ss):41-38 



08/10/2005 19:05 FAX 408 720 9397 



BST&Z 



52)039 



Claims 18. 30 and 39^2 Are Not Obvious. 

With regard to Claims 18, 30 and 39-42, the Office Action mailed Aug. 19, 2003 
(9 of paper 5) states that Sun taught performing a packed subtract and write cany, a 
packed absolute value and a packed add horizontal. Appellant respectfully disagrees and 
again notes that Claims 18, 30 and 39-42 set forth a packed subtract and write cany 
operation, a packed absolute value and read carry operation, and a packed add horizontal 
operation (emphasis added). 

Claim 39, for example, sets forth: 

39. (Previously Presented) A processor comprising: 

decode logic to decode a packed sum of absolute differences (PSAD) 
instruction having a first format to identify a first set of packed data, said decode 
logic to initiate a first set of operations on the first set of packed data responsive 
to decoding the PSAD instruction, the first set of operations comprising: 
a packed subtract and write carry (PSUBWC) operation; 
a packed absolute value and read cany (PABSRC) operation; and 
a packed add horizontal (PADDH) operation.; and 
execution logic to perform the first set of operations initiated by the decode 
logic. 

In determining the scope and content of the cited references with regard to the 
instant claims at issue, appellant respectfully submits that Sun is directed to a set of visual 
instructions used primarily to write graphics and multimedia applications (p. 41, first 
paragraph). One of these instructions (the vis_pdist0 instruction) accumulates the 
absolute values of differences into a destination accumulator (p. 87 T last paragraph). Sun 
states that "the pixels are subtracted from one another, pair wise, and the absolute values 
of the differences are accumulated into accum" (p.87, 4.7.11, Description, first 
paragraph). Sun does not teach a packed subtract and write carry operation or a packed 
absolute value and read carry operation, as set forth in claims 18, 30 and 39- 
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Sidwell is directed Lo an arithmetic unit for packed arithmetic. The arithmetic unit 
6 of Sidwdl is comprised of a collection of separate packed arithmetic execution units, 
each responsible for some subset of packed arithmetic instructions (Fig. 2, coL 5, lines 
. 15-19). Sidwell discloses that the multipJy-add unit 76 is capable of executing a single 
instruction, the result of executing that instruction being to multiply together respective 
pairs of objects from two operands and to add together the results to provide a final result 
(coh 7, lines 2 J -31, emphasis supplied). Sidwell discloses that the obvious packed 
arithmetic unit 80 performs the addition, subtraction, comparison and multiplication of 
packed numbers (Figs. 4 and 5, col. 5, line 50 through col. 6, line 47). Sidwell states that 
'The execution units 2, 4, 6 do not hold any state between instructions. Thus subsequent 
instructions are independent." (coL 4, lines 36-38) C 

Therefore, Sidwell teaches away from an execution unit to hold carry state 
between instructions, which is what appellant has done in the packed subtract and write 
cany operation, and the packed absolute value and read carry operation as set forth by 
claims 18, 30 and 39. 

Appellant respectfully points out some of the differences between the cited 
references and the instant claims at issue. Sidwell does not disclose a packed sum of 
absolute differences instruction- Sun discloses that in the vjs_pdi$t() instruction, one 
source is also an accumulating destination register. Therefore, in the combined system of 
Sidwell and Sun, the vis_pdist() instruction would be expected to compute an 
accumulation of current and prior absolute differences rather than a sum of the absolute 
differences on a first identified set of packed data as set forth in the instant claims at 
issue. There is no accumulation of prior absolute differences in the packed sum of 
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absolute differences of the instant claims at issue. Thus, appellant submits that an 
absence in the claimed invention of the expected accumulation of prior absolute 
differences from the combined system of Sidwell and Sun would be unexpected. 

Because the packed sum of absolute differences instmction does not require an 
accumulator source, the data path for the packed sum of absolute differences may be 75% 
as wide as one requiring a third source and the number of read ports required in the 
register file may be 66% as many as would otherwise be required for reading a third 
source register. Such reductions are statistically significant. Of practical significance is 
that since there is no version of the vis_pdist() instruction that does not use an 
accumulation of prior absolute differences, an additional instruction, vis_fzero(), is 
required to initialize the accumulator before the vis_pdistO instruction can be used (e.g. 
see Sun, p. 88, line 9-10). 

Additionally, neither Sidwell nor Sun disclose a packed subtract and write cany 
operation or a packed absolute value and read carry operation, as set forth in claims 18, 
30 and 39. Neither cited reference discusses or suggests the writing of any carry state as 
part of a packed subtraction operation or the reading of any carry state as part of a packed 
absolute value operation. In fact, Sidwell teaches away from an execution unit to hold 
cany stale between instructions. Therefore, the presence of such operations in the 
combined system of Sidwell and Sun would be unexpected. 

The Office Action mailed Jan. 10, 2004 (8.6) states that the fact that Sidwell 
teaches tlLat the execution units do not hold state is immaterial because no execution unit 
holds state in any system. What Sidwell teaches is (col. 4, Jines 36-38, emphasis 
supplied) that, "The execution units 2, 4, 6 do not hoid any state b etween instructions, 
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'Thus subsequent instructions are independent ." Therefore, appellant respectfully submits 
that the presence of such unexpected operations to hold cany state between subsequent 
instructions is evidence of nonobviousness. 

Because the packed subtract and write carry operation and the packed absolute 
value and read carry operation employ an execution unit to hold caixy state between 
subsequent instructions rather than duplicating the adder/subtractor circuitry, the 
execution circuitry for performing the packed subtract and write carry operation and the 
packed absolute value and read cany operation may be 50% of the execution circuitry for 
performing an absolute difference operation without an execution unit to bold carry state 
between instructions. Such a reduction i$ statistically significant. The present 
specification, for example, discloses (Figs. 9 andlO, p. 19, lines 1-8) that: 

*'In one embodiment, the PSUB WC/PABSRC arithmetic element 900 is the same circuitry used to 
perform the PADD instruction 151. The mux 920 is added and the 0*^0 bus is routed to the 
register 940 and the Cbipm.0 bus routed to the mux 920 to provide for the PS AD instruction 1 60. 

By saving the carry bits from the PSUBWC operation and using the saved cany bits to control the 
subsequent PABSRC operation, the same circuitry used to perform the PADD hardware may be 
used to perform both the PSUBWC and the PABSRC operations with relatively little additional 
circuitry." 

Therefore, through a modification of the PADD hardware to support saving the 
carry bits from one operation and using the saved carry bits to control a subsequent 
operation, the same circuitry may be reused and no new circuitry to perform an absolute 
difference operation is required, which has substantial practicaJ significance. 

The Final Office Action of April 9, 2004 (7.4 of paper no. 8) states that a packed 
subtract and write cany operation and a packed absolute value and read carry operation 
are inherently present in Sun's system. Appellant respectfully disagrees. 

Two commonly used techniques for computing an absolute differences found in 
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the design of floating point mantissa arithmetic are: (1) compare two numbers and reorder 
to subtract the smaller from the larger, or (2) subtract the two numbers in both directions 
and select the positive result. Appellant respectfully submits that one of these two 
alternatives may reasonably be expected to be inherently present in Sun's system rather 



cany operation set forth by the present application. 

Foi example, appellant respectfully points out that in US Patent 5,938,756 
(hereafter "Van Hook") it shows the vis_pdist() circuit illustrated below: 



In the disclosure of Van Hook (Figs. 9a-9b, coL 10 line 53 through coL 11, line 4, 

emphasis supplied) it also states: 

"Referring now to FIGS. 9a-9b, the pixel distance computation instructions, and the pixel distance 
computation circuit arc illustrated . As shown in FIG- 9a, there is one graphics data distance 
computation instruction 133 for simultaneously accumulating the absolute differences between 
graphics data, eight pairs at a time. The PDIST instruction 138 subtracts eight 8-bit graphics data 
in the rsl register from eight corresponding 8-bit graphics data in the rs2 register. The sum of the 
absolute values of the differences is added to the concent of the rd register. The PDIST instruction 
is typically used for motion estimation in video compression algorithms. 
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As *hown in PIG. 9b, in this embodiment, the pixel distance computation circuit 36 comprises 
eight pairs of 8 bit subtracters 57a-57h. Additionally, the pixel distance compulation circuit 56 
further comprises three 4:2 carry save adders 61a-61c, a 3:2 carry save adder 62, two registers 63a- 
63b, and a 1 1 -bit carry propagate adder 65, coupled to each other as shown." 

Therefore, appellant submits that alternative (2), subtracting the two numbers in 
both directions and selecting the positive result, is what is likely to be inherent in Sun. 
As appellant indicates above, that alternative requires new circuitry to perform an 
absolute difference operation, which has twice as many adder/subtractors (e.g. see Sun, 
Fig. 9b, 57a-h) as the execution circuitry for performing a packed subtract and write carry 
operation and a packed absolute value and read carry operation (e.g. see Fig. 10 5 1000- 
1 070 of the present application). 

Thus* the presence of a packed subtract and write carry operation and a packed 
absolute value and read carry operation in the combined system of Sidwell and Sun would 
be nonobvious. 

Additionally, the instant claims at issue set forth decode logic to initiate a set of 
operations responsive to decoding the PSAD instruction, the operations comprising: a 
packed subtract and write carry (PSUBWC) operation; a packed absolute value and read 
cany (PABSRC) operation; and a packed add horizontal (PADDH) operation. For 
example, in the present application (Fig. 1, p. 9, lines 8-10) it states: 

'The decode unit 140 is used for decoding instructions received by the processor 105 into control 
signals and/or microcode entry points." 

and further (Fig. 5, p. 13, lines 5-21) states that: : 

u ln step 500, the first operation is a packed subtract and write carry (PSUBWC) operation. ...In 
step 510, the second operation is a packed absolute value and read carry (PABSRC) operation. 
. . .In step 520, the third operation is a packed add horizontal (PADDH) operation." 

Decoding instructions into such sets of control signals and/or microcode entry 
points is not disclosed by Sidwelf or by Sun. Appellant respectfully submits that a 
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presence in the claimed invention of decode logic to initiate the packed subtract and write 
carry (PSUBWC), the packed absolute value and read carry (PABSRC), and the packed 
add horizontal (PADDH) operations responsive to decoding the PSAD instruction, which 
are not disclosed by the combined references of Sidwell and Sun, would be unexpected 
without viewing the prior art in retrospect with the aid of appellants disclosure. 

Appellant respectfully submits that Van Hook is relevant to what would be 
expected from the combined system of Sidwell and Sun to execute the visj>dist() 
instruction. To that end, appellant respectfully points out that in Fig. 9b of Van Hook 
shown above, a pixel distance computation circuit of the second partitioned execution 
path is designed to perform the entire vis_pdist() instruction rather than a microcode 
operation for a portion of the vi$_pdist() instruction- 

Because the PSAD instruction is decoded into a set of microcode control signals, 
66% less instructions may need to be fetched and decoded for computing a packed sum of 
absolute differences, which is statistically significant Of practical significance, is that 
not ail (if any) of the operations PSUBWC, PABSRC and PADDH, may actually need to 
be supported outside of the microcode by opcodes for user programmable instructions. 
Yet, reuse of the PADD hardware for PSUBWC and PABSRC, and of the PMAD 
hardware for PADDH may be accomplished through use of a microcode sequence. 
Neither Sidwell nor Sun disclose decoding the vis_pdistO instruction into a set of 
microcode operations or the reuse of packed adder hardware or packed mnltiply-add 
hardware for computing a packed sum of absolute differences. 

Therefore, appellant respectfully submits that without viewing the prior art in 
retrospect with the aid of appellant's disclosure, no suggestion is provided by Sidwell or 
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3. Claims 16 and 36 Are Not Obvious. 

With regard to Claims 16 and 35, the Office Action states that where 
modifications would be required in the combined references, appellant is in error to 
suggest that the references must disclose or suggest such modifications, and that the 
measure is what the teachings would suggest to one of ordinary skill in the art. Appellant 
respectfully disagrees with the Examiner's characterization of what Sidwell and Sun 
taught or would suggest to one of ordinary skill in the art 

Once again, in determining the scope and content of the cited references with 
regard to the instant claims at issue, appellant respectfully submits that Sidwell is directed 
to an arithmetic unit for packed arithmetic. The arithmetic unit 6 of Sidwell is comprised 
of a collection of separate packed arithmetic execution units, each responsible for some 
subset of packed arithmetic instructions (Fig. 2, col. 5, lines 15-19). Sidwell discloses 
that the obvious packed arithmetic unii 80 performs the addition, subtraction, comparison 
and multiplication of packed numbers (Figs. 4 and 5, coL 5, line 50 through coh 6, line 
47). Sidwell discloses that the multiply-add unit 76 is capable of executing a single 
instruction, the result of executing that instruction being to multiply together respective 
pairs of objects from two operands and to add together the results to provide a final result 
(coL 7, lines 21-31, emphasis supplied). Sidwell does not disclose or suggest reuse of the 
multiply-add unit 76 even for performing packed multiply instructions (mul2ps). Nor 
does Sidwell disclose or suggest cooperation between the various collection of separate 
packed arithmetic units to perform any of the packed instructions. 

Sun is directed to a set of visual instructions used primarily to write graphics and 
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multimedia applications (p. 41, first paragraph). One of these instructions (the vis_pdi$t() 

instruction) accumulates the absolute values of differences into a destination accumulator 

(p. 87, Ja$t paragraph). Sun states that "the pixels are subtracted from one another, pair 

wise, and the absolute values of the differences are accumulated into accum" (p.87, 

4_7.11, Description, first paragraph). Sun does not disclose or suggest a version of the 

vis jxlistO instruction that does not use an accumulation of prior absolute differences. 

Appellant now respectfully points out some of the differences between the cited 

references and the instant claims at issue. Claim 16, for example, sets forth: 

16. (Previously Presented) A processor comprising: 

a decode unit to decode a plurality of packed data instructions including a 
packed sum of absolute differences (PS AD) instruction having a first format to 
identify a first set of packed data, and a packed multiply-add (PMAD) 
instruction having a second format to identify a second set of packed data, said 
decode unit to initiate a first set of operations on the first set of packed data 
responsive to decoding the PSAD instruction and to initiate a second set of 
operations on the second set of packed data responsive to decoding die PMAD 
instruction; and 

an execution unit to perform a first operation of the first set of operations 
initiated by the decode unit and to perform a second operation of the second set 
of operations initiated by the decode unit 

Sidwell does not disclose a packed sum of absolute differences instruction. Sun 
discloses that in the vis_pdist() instruction, one source is also an accumulating destination 
register. Therefore, in the combined system of Sidwell and Sun, the vis_pdist() 
instruction would be expected to compute an accumulation of current and prior absolute 
differences rather than a sum of the absolute differences on the first identified set of 
packed data as set forth in the instant claims at issue. There is no accumulation of prior 
absolute differences in the packed sum of absolute differences of the instant claims at 
issue. Thus, appellant submits that an absence in the claimed invention of the expected 
accumulation of prior absolute differences from the combined system of Sidwell and Sun 
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would be unexpected. 

For example, Sun shows a diagram of a floating-point & graphics unit to perform 
the vis _pdist() instruction (Fig. 2-4, p. 1 1) as follows: 




%an2-/ FTaatuig Pcir* and Graphics UnU 

Sun discloses that the graphics adder (GR+) and graphics multiplier (GR*) 
perform the graphics operations of the VIS instruction set (p. 1 1, lines 10-1 1). 

Appellant again refers to the disclosure of US Patent 5,938,756, Van Hook. Like 
Sun, Van Hook shows a diagram of a graphics execution unit (GRU) having first 
(corresponding to GR+) and second (corresponding to GR*) independent execution paths 
to independently execute graphics instructions (Fig. 2, coL 4, lines 3-10) as follows: 
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Figure 2 

Van Hook discloses that the second partitioned execution path comprises a pixel 
distance computation circuit (Fig. 5, 36, col. 5, lines 4-5). 

Sidwell shows a diagram of an arithmetic unit comprised of a collection of 
separate packed arithmetic execution units, each responsible for some subset of packed 
arithmetic instructions (Fig. 2, col. 5, lines 15-19) as follows: 
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Like Sidwell, Van Hook shows a diagram of his first partitioned execution path 
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similarly comprised of a collection of separate packed arithmetic execution units, each 
responsible for some subset of packed arithmetic instructions (Fig. 4, col. 4 7 lines 26-46) 
as follows: 




For example, the obvious packed arithmetic unit 80 of Sidwell may perform some subset 
of packed arithmetic instructions similar to those performed by the partitioned carry adder 
37 of Van Hook (e.g. see Van Hook, col. 4, lines 47-54). Therefore, appellant 
respectfuDy submits that Van Hook is relevant to what would be expected from the 
combined system of Sidwell and Sun, with regard to an included vis_pdist() instruction. 

To that end, appellant respectfully points out that Van Hook shows a diagram of 
the pixel distance computation circuit of the second partitioned execution path to perform 
the vis jxlistQ circuit as illustrated below; 



42390P5943C 



-38- 



PAGE 51/198 * RCVD AT 8/10/2D05 9:58:36 PM [Eastern Daylight Time] * SVR:USPT0^FXRF-5rt * DNIS:2738300 ft CS!D:408 720 9397 * DURATION (mm-ss):41-38 



08/10/2005 19:07 FAX 408 720 9397 



BST&Z 



@052 




_ d »i 4 *iAA 




3 C 



Figure Sh 



■J a «t 



■*» 

— ~* 



X 



33T-T 




In Ihe disclosure of Van Hook (Figs. 9a-9b 4 coL 10 line 53 through col. 1 1, line 4, 
emphasis supplied) it also states: 

''Referring now to FIGS. 9a-9b, the pixel distance computation instructions, and the pixel distance 
computation circuit are illustrated . As shown in FIG. 9a, there is one graphics data distance 
computation instruction 1 38 for simultaneously accumulating the absolute differences between 
graphics data, eight pairs at a time. The PDIST instruction 3 38 subtracts eight 8-bit graphics data 
in the rsl register from eight corresponding 8-bit graphics data in the rs2 register. The sum of the 
absolute values of the differences is added to the content of the rd register . The PDIST instruction 
is typically used for motion estimation in video compression algorithms. 
As siiown in FIG. 9b, in this embodiment, the pixel distance computation circuit 36 comprises 
eight pairs of 8 bit sub tractors 57a-57h. Additionally, the pixel distance computation circuit 56 
further comprises three 4:2 carry save adders 61a-61c, a 3:2 carry save adder 62. two registers 63a- 
63b, and a J I. -bit carry propagate adder 65, coupled to each other as shown." 

Therefore, appellant submits that an absence in the claimed invention of the 
expected accumulation of prior absolute differences from the combined system of Sidwell 
and Sun would be unexpected. 

Because the packed sura of absolute differences instruction does not require an 
accumulator source, the data path for the packed sum of absolute differences (requiring 
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only two sources and one destination) may be 75% as wide as one also requiring a third 
source, which is statistically significant. Of practical significance is thai the number of 
computational stages required would be five for computing just a sum of absolute 
differences (e.g. eliminating 3:2 carry save adder 62 of Van Hook) instead of six as 
shown by Van Hook, which could impact the maximum design frequency. ' Also since 
there is no version of the vis_pdist0 instruction that does not use an accumulation of prior 
absolute differences, an additional instruction, vis JzeroQ, is requited by Sun to initialize 
the accumulator before the vis_pdistO instruction can be used (e.g. see Sun, p. 88, line 9- 
10). 

Thus, the absence of the expected accumulation of prior absolute differences from 
the combined system of Sidwell and Sun would be nonobvious. 

Claim 6 also sets forth an execution unit to perform a first operation of the first set 
of operations initiated by the decode unit responsive to decoding the PS AD instruction 
and to perform a second operation of the second set of operations initiated by the decode 
unit responsive to decoding the PMAD instruction. Since Sun does not disclose a 
multiply-add instruction and since Sidwell does not disclose a sum of absolute 
differences, neither of the references discloses or suggests an execution unit to perform an 
operation initiated by the decode unit responsive to decoding the PSAD instruction and 
an operation initiated by the decode unit responsive to decoding the PMAD instruction. 

Sidwell admits that the multiply-add unit is capable of executing a single 
instruction , the result of executing that instruction being to multiply together respective 
pairs of objects from two operands and to add together the results to provide a final result 
(col. 7, lines 21-3U emphasis supplied). 
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Since Sidwell teaches that the multiply-add execution unit 76 could perform only 
the operations initiated by the decode unit responsive to decoding a packed multiply-add 
instruction, Sidwell teaches away from the multiply-add execution unit 76 performing a 
first operation initiated by the decode unit responsive to decoding the PS AD instruction. 

Conspicuously, Sidwell does not even share the multiplier functionality of the 
multiply-add unit 76 (used for muladd2ps) with the multiply instruction, mul2ps 7 of the 
obvious packed arithmetic unit 80 (col. 5, lines 37-43, col. 6, lines 19-22). Appellant 
respectfully submits that the alleged combination of references should not be considered 
obvious when Sidwell teaches away from precisely what appellant has done. 

As cited above, the general rule applicable to a rejection based on a combination 
of references was Stated in Schaffer^ 108 USPQ at 328-329: 

[References! may be combined for the purpose of showing thai a claim is unpatentable. However, 
they may not be combined . indiscriminately, and to determine whether the combination of 
references is proper, the following criterion is often used: namely, whether the prior art suggests 
doing what an applicant has done. ... [It] is not enough for a valid rejection to view the prior art in 
retrospect once an applicant's disclosure is known. The art applied should be viewed by itself lo 
see if it fairly disclosed doing what an applicant has done. 

Moreover, without viewing the prior art in retrospect with the aid of appellant's 
disclosure, there is no suggestion in the cited references of an execution unit, as set forth 
by the instant claims at issue, to perform an operation initiated responsive to decoding the 
PSAD instruction and also an operation initiated responsive to decoding the PMAD 
instruction. Thus, appellant also submits that the presence in the claimed invention of an 
execution unit to perform operations initiated responsive to decoding both the PSAD 
instruction and also the PMAD instruction would be unexpected from the combined 
system of Sidwell and Sun. 

Re fencing once again to Van Hook, appellant respectfully submits that if the pixel 
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distance computation circuit illustrated in figure 9b could reasonably be expected to 
combine with Sidwell, it could not reasonably be expected to perform the multiply-add 
instruction of Sidwell, having no multipliers and the capacity of only J ) bits for full 
additions (Van Hook, Fig. 9b, 65, col. 11, line 3). Also, like Sidwell, Van Hook 
conspicuously includes only one instruction (PDE3T) in the instruction group 206, for 
pixel distance unit 56 (Fig. 5, 56, col. 5, lines 16-17, Fig. 6c, 206). 

Nor could Sidwell 7 s multiply-add unit 76 reasonably be expected to perform the 
Sun's vis_pdist() instruction. Sidwell 9 s system provides no path for an accumulator input 
to packed arithmetic unit 6, for example, from result bus 56 or as a third source operand 
to packed multiply-add unit 76 (Figs, 1, 2, 4, and 6; col. 5, line 15 through coL 7, line 53). 
And as stated above, Sidwell does not even share the multiplier functionality of the 
multiply-add unit 76 (used for muladd2ps) with the multiply instruction, mul2ps, of the 
obvious packed arithmetic unit 80 (col- 5, lines 37-43, coL 6, lines 19-22). 

Without viewing the prior art in retrospect with the aid of appellant's disclosure, 
the combined system of Sidwell and Sun could not reasonably be expected to include an 
execution unit to perform operations initiated responsive to decoding both the PSAD 
instruction and also the PMAD instruction. Therefore, appellant respectfully submits that 
no suggestion is provided by Sidwell or Sun for doing what appellant has done. 

Because the execution unit used to perform the PMAD instruction can also 
perform the PADDH operation responsive to decoding both the PSAD instruction the 
utility of the PMAD execution unit may be increased by 100%, which is statistically 
significant. Of practical significance is that through relatively minor modifications to die 
existing execution unit a sum of absolute differences instruction can be supported with 
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only a negligible increase in circuit area. 

Thus, the presence of an execution unit to perfonm operations initiated responsive 
to decoding both the PSAD instruction and also the PMAD instruction in the combined 
system of Sidwell and Sun would also be nonobvious. 

Accordingly in light of the above arguments, Claims 16 and 36, are not obvious in 
view of the cited references. 
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C. SECOND 35 U.S.C § 103(a) REJECTIONS 

Claims 21-24, 33-34 and 43-44 stand rejected under 35 USC § 103(a) as allegedly 
being unpatentable over US Patent 5,859,789 (Sidwell) in view of in view of Visual 
Instruction Set (VIS ™) User's Guide, Sun Microsystems, March 1997 (Sun) and further 
in view of US Patent 5,721,697 (Lee). 

1 Claims 21-22, 33-34 and 43-44 Are Not Obvious. 

With regard to Claims 21, 33 and 43, the Office Action states that it would have 
been obvious to combine Lee irito.a system of Sidwell and Sun to produce the features of 
the claims. Appellant respectfully disagrees. 

bi determining the scope and content of the cited references with regard to the 
instant claims at issue, appellant respectfully submits that Sidwell is directed to an 
arithmetic unit for packed arithmetic. The arithmetic unit 6 of Sidwell is comprised of a 
collection of separate packed arithmetic execution units, each responsible for some subset 
of packed arithmetic instructions (Fig. 2, coL 5, lines 15-19). Sidwell discloses that the 
obvious packed arithmetic unit 80 performs the addition, subtraction, comparison and 
multiplication of packed numbers (Figs. 4 and 5, col. 5, line 50 through col. 6, line 47). 
Sidwell discloses that the multiply-add unit 76 is capable of executing a single 
instruction, the result of executing that instruction being to multiply together respective 
pairs of objects from two operands and to add together the results to provide a final result 
(coL 7, lines 21-31, emphasis supplied). Sidwell does not disclose or suggest reuse of the 
multiply-add unit 76 even for performing packed multiply instructions (mul2ps). 
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Sun is directed to a set of visual instructions used primarily to write graphics and 
multimedia applications (p. 41, first paragraph). One of these instructions (the vis_pdist() 
instruction) accumulates the absolute values of differences into a destination accumulator 
(p. 87, last paragraph). Sun states that "the pixels are subtracted from one another, pair 

wise, and the absolute values of the differences are accumulated into occum" (p.87, 

i 

4.7.11, Description, first paragraph). Sun does not disclose or suggest a version of the 
vis_pdistQ instruction that does not use an accumulation of prior absolute differences. 
Nor does Sun disclose or suggest a plurality of partial product selectors to insert an 
element of a plurality of elements of a packed data into and substituting for bit positions 
of one or more partial products and add the plurality of elements together. 

Lee is directed to a multiplier that is modified to perform tree additions (Abstract). 
Lee aligns data from one input in partial products through use of second input value. 
Each bit of the second input value is set to zero except for a first subset of bits, starting 
with the Jow order bit which are set to one at intervals equal to a bit length of each 
addend (col. 1 lines 47-55 T col- 4, line 44 through col. 5, line 2). Lee then generates 
control inputs to force to logic zero bit positions that do not correspond to the bit 
positions of an element to be added (col. 5, lines 3-5). In order to sum four 4-bit 
numbers, forty-eight (48) bit positions are forced to logic zero bit positions that do not 
correspond to the bit positions of the elements to be added together. (Table 6; cols. 6, 
lines 9-61). Lee does not disclose or suggest use of partial product selectors to insert the 
elements of a packed data into bit positions of the partial products to add the elements 
together. 

Appellant respectfully points out some of the differences between the cited 
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references and the instant claims at issue. Claim 21, for example, sets forth: 

21. (Previously Presented) The processor of Claim 16, wherein performing the first 
operation causes the execution unit to: 

produce a first plurality of partial products in a multiplier having a plurality 
of partial product selectors; 

insert an element of a first plurality of elements of a first packed data into 
and substituting for bit positions of one or more of the first plurality of partial 
products by using partial product selectors corresponding to the bit positions; 
and 

add the first plurality of elements together to produce a first result including 
a field comprising a sum of the first plurality of elements, said field having a 
least significant bit. 

In addition to the limitations presented above with regard to Claim 16, Claim 21 
sets forth that in performing the first operation responsive to decoding the packed sum of 
absolute differences instruction, the execution unit produces a plurality of partial products 
in a multiplier having a plurality of partial product selectors and inserts an element of a 
plurality of elements of a packed data into and substituting for bit positions of one or 
more of the first plurality of partial products by using partial product selectors 
corresponding to the bit positions and adding the plurality of elements together. 

Neither Sidwell nor Sun disclose a plurality of partial product selectors to insert 
elements of a packed data into and substituting for bit positions of one or more partial 
products and adding the elements together. 

Lee's method, on the other hand, generates control inputs to force to logic zero bit 
positions that do not correspond to the bit positions of an element to be added, rather than 
inserting elements of a packed data into and substituting for bit positions to be added as 
set forth in Claim 21 (col. 5, lines 3-5). Further, Lee aligns data from one input in partial 
products through use of another input value rather than using the partial product selectors 
corresponding to the bit positions to be added as set forth in Claim 21 (coh 1 lines 47-55). 

Each bit of the second input value is set to zero except for a first subset of bits, 
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starting with the low order bit which are set to one at intervals equal to a bit length of 
each addend (col. 1 lines 47-55). 

The visjdistO instruction of Sun already has three source operands, one of which 
is also the destination (p. 88, first paragraph). To perform the alignment in partial 
products <is suggested by Lee a fourth source operand would be necessary. SidwelTs 
system provides no third path for source inputs to packed arithmetic unit 6, much less a 
fourth (Figs, 1, 2 7 4, and 6; col. 5, line 15 through coL 7, line 53). Therefore Sidwell's 
system could not perform the alignment in partial products as disclosed by Lee for Sun's 
vis_pdi$t() instruction without substantial modification to the method or apparatus 
disclosed, and such modification, was not taught, suggested, or motivated by any of the 
cited references without viewing the prior art in retrospect with the aid of appellant's 
disclosure. 

As cited above, the general rule applicable to a rejection based on a combination 
of references stated in Schctffer, 108 USPQ at 328-329 is that [it] is not enough for a valid 
rejection to view the prior art in retrospect once an applicant's disclosure is known. The 
art applied should be viewed by itself to see if it fairly disclosed doing what an applicant 
has done. 

Without viewing the prior art in retrospect with the aid of appellant's disclosure, 
there is no suggestion in the cited references of producing a plurality of partial products in 
a multiplier having a plurality of partial product selectors and inserting an element of a 
plurality of elements of a packed data into and substituting for bit positions of one or 
more of the first plurality of partial products by using partial product selectors 
corresponding to the bit positions and adding the plurality of elements together. 
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Therefore, appellant respectfully submits that no suggestion is provided by 
Sidwel) 7 Sun or Lee for doing what appellant has done. 

Because Lee generates control inputs to force bit positions that do not correspond 
to the bit positions of an element to be added, forty-eight (48) bit positions are forced to 
logic zero in order to sum four 4-bit numbers (Table 6; cols. 6, lines 9-61). Thus circuitry 
at forty-eight (48) bit positions must be modified rather than circuitry at the sixteen (16) 
bit positions of the four 4-bit numbers being added Such considerations are of practical 
significance. 

As shown in the present application, with regard to Fig. 12 (1220 and 1221), only 
the sixty-four (64) bit positions being added need to be modified as apposed to the eighty 
(80) bit positions not corresponding to bit positions being added, which is of practical 
significance. 
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Appellant illustrates the method Lee taught for an 16x16 multiplier to sum 8-bit 

values: 
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0000000000000000 0 
0000000000000000 0 
0000000000000000 0 
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0000000000000000 0 
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0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 z z z z z z z z z 0 0 0 0 0 0 0 0 

Using the partial product selectors corresponding to the bit positions being added 
to insert elements of a packed data into those bit positions, as set forth in Claim 21, rather 
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than aligning data from one input in partial products with a second input value and 
generating control inputs to force bit positions to zero that do not correspond to the bit 
positions being added, as was taught by Lee, would result in nine adjacent partial 
products (siown as shaded) being available for adding up to nine 8-bit values rather than 
of two. Thus the potential utilization of the multiplier circuitry may be increased by up to 
350%, which is statistically significant. 

Thus, producing a plurality of partial products in a multiplier having a plurality of 
partial product selectors and inserting an element of a plurality of elements of a packed 
data into and substituting for bit positions of one or more of the first plurality of partial 
products by using partial product selectors corresponding to the bit positions and adding 
the plurality of elements together in the combined system of S idwell, Sun and Lee would 
be nonobvious- 

Accordingly in light of the above arguments, Claims 21-22, 33-34 and 43-44 are 
not obvious in view of the cited references. 
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^Claims 23-24 Are Not Obvious. 

In determining the scope and content of the cited references with regard to the 
instant claims at issue, appellant respectfully submits that Sidwell is directed to an 
arithmetic unit for packed arithmetic. The arithmetic unit 6 of Sidwell is comprised of a 
collection of separate packed arithmetic execution units, each responsible for some subset 
of packed arithmetic instructions (Fig- X coL 5, lines 15-19). Sidwell discloses that the 
obvious packed arithmetic unit 80 performs the addition, subtraction, comparison and 
multiplication of packed numbers (Figs. 4 and 5, col. 5, line 50 through col. 6, line 47). 
Sidwell discloses that the multiply-add unit 76 is capable of executing a single 
instruction, the result of executing that instruction being to multiply together respective 
pairs of objects from two operands and to add together the results to provide a final result 
(coL 7 # lines 21-31, emphasis supplied). Sidwell does not disclose or suggest producing a 
packed result having two distinct sums of products as a result of the multiply-add 
instruction. Nor does Sidwell disclose or suggest reuse of the multiply-add unit 76 even 
for performing packed multiply instructions (mul2ps). 

Sun is directed to a set of visual instructions used primarily to write graphics and 
multimedia applications (p. 41, first paragraph). One of these instructions (the vis_pdist() 
instruction) accumulates the absolute values of differences into a destination accumulator 
(p. 87, last paragraph). Sun states that "the pixels are subtracted from one another, pair 
wise, and the absolute values of the differences are accumulated into accum" (p.87 7 
4.7.11, Description, first paragraph). Sun does not disclose or suggest a version of the 
vis_pdist() instruction that does not use an accumulation of prior absolute differences. 
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Nor does Sun disclose or suggest a plurality of partial product selectors to insert an 
element of a plurality of elements of a packed data into and substituting for bit positions 
of one or moTe partial products and add the plurality of elements together. 

Lee is directed to a multiplier that is modified to perform tree additions (Abstract). 
Lee aligns data from one input in partial products through use of second input value. 
Each bit of the second input value is set to zero except for a first subset of bits, starting 
with the low order bit which are set to one at intervals equal to a bit length of each 
addend (coL 1 lines 47-55, col. 4, line 44 through col. 5, line 2). Lee then generates 
control inputs to force to logic zero bit positions that do not correspond to the bit 
positions of an element to be added (col. 5, lines 3-5). In order to sum four 4-bit 
numbers, forty-eight (48) bit positions are forced to logic zero bit positions that do not 
correspond to the bit positions of the elements to be added together. (Table 6; cols. 6, 
lines 9-61). Lee does not disclose or suggest use of partial product selectors to insert the 
elements of a packed data into bit positions of the partial products to add the elements 
together. 

Appellant respectfully points out some of the differences between the cited 
references and the instant claims at issue. Claim 23, for example, sets forth: 

23. A processor comprising: 

a decode unit to decode a plurality of packed data instructions including a 
packed sum of absolute differences (PSAD) instruction having a first formal: to 
identify a first set of packed data, arid a packed muluply-add (PMAD) 
instruction having a second format to identify a second set of packed data, said 
decode unit to initiate a first set of operations on the first set of packed data 
responsive to decoding the PSAD instruction and to initiate a second set of 
operations on the second set of packed data responsive to decoding the PMAD 
instruction; and 

an execution unit to perform a first operation of the first set of operations 
initiated by the decode unit and to perform a second operation of the second set 
of operations initiated by the decode unit; 

wherein performing the first operation causes the execution unit to: 
produce a first plurality of partial products in a multiplier having a 
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plurality of partial product selectors, 

insert an element of a first plurality of elements of a first packed data 
into and substituting for bit positions of one or more of lha first plurality of 
partial products by using partial product selectors corresponding to Hit bit 
positions, and 

add the first plurality of elements together to produce a first result 
including a field comprising a sum of the first plurality of elements, said field 
having a least significant bit; 

and wherein performing the second operation causes the execution unit to: 

produce a second plurality of partial products in the multiplier having 
the plurality of partial product selectors, the second plurality of partial products 
comprising four distinct sets of partial products including a first set of partial 
products corresponding to a first product for elements of the second set of 
packed data, a second set of partial products corresponding to a second product 
for elements of the second set of packed data, a third set of partial products 
corresponding to a third product for elements of the second set of packed data, 
and a fourth set of partial products corresponding to a fourth product for 
elements of the second set of packed dam, and 

add the first set of partial products together with the second set of 
partial products to produce a first distinct element of a packed result and add the 
third set of partial products together with the fourth set of partial products to 
produce a second distinct element of the packed result 

In addition to the limitations presented above with regard to Claim 16, Claim 23 
sets forth that in performing the first operation responsive to decoding the packed sum of 
absolute differences instruction, the execution unit produces a plurality of partial products 
in a multiplier having a plurality of partial product selectors and inserts an element of a 
packed data into and substituting for bit positions of one or more of the first plurality of 
partial products by using partial product selectors corresponding to the bit positions. 

As shown above with regard to Claim 21, neither SidweJl nor Sun disclose a 
plurality of partial product selectors to insert elements of a packed data into and 
substituting for bit positions of one or more partial products and adding the elements 
together. 

Lee's method, on the other hand, generates control inputs to force to logic zero bit 
positions that do not correspond to the bit positions of an element to be added, rather than 
inserting elements of a packed data into and substituting for bit positions to be added as 
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set forth in Claim 21 (col. 5, lines 3-5). Further, Lee aligns data from one input in partial 
products through use of another input value rather than using the partial product selectors 
corresponding to the bit positions to be added as set forth in Claim 21 (coL 1 lines 47-55). 

Appellant respectfully submits that no suggestion is provided by Sidwell* Sun or 
Lee for doing what appellant has done, and has established that the differences have both 
practical and statistical significance. 

Thus, producing a plurality of partial products in a multiplier having a plurality of 
partial product selectors and inserting an element of a plurality of elements of a packed 
data into and substituting for bit positions of one or more of the first plurality of partial 
products by using partial product selectors corresponding to the bit positions and adding 
the plurality of elements together in the combined system of Sidwell, Sun and Lee would 
be nonobvious. 

Claim 23 also sets forth that in performing the second operation responsive to 
decoding the packed multiply-add instruction, the execution unit produces four distinct 
sets of partial products including a first set corresponding to a first product, a second set 
corresponding to a second product, a third set corresponding to a Lhird product, and a 
fourth set corresponding to a fourth product, and add the first set together with the second 
set of partial products to produce a first distinct element of a packed result and add the 
third set together with the fourth set of partial products to produce a second distinct 
element of the packed result. 

The multiply-add of Sidwell does not add a first and second set of partial products 
and a third and fourth set of partial products produce a packed result having two distinct 
elements. Nor does Sidwell suggest another form of multiply-add instruction to produce 
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a packed result. Neither Sun nor Lee disclose or suggest a multiply-add instruction to add 
a fust and second products and a third and fourth products produce a packed result. 

Yet, the packed multiply-add disclosed by the present application to produce a 
packed resuJt having two distinct elements may be useful in writing applications such as 
those considered by Sun or Sidwell or Lee. For example, one application known as alpha 
blending of images, performs computation of a pixel color value as follows: 

ai*Sl H-QV2*S2 

which would permit computation of two new pixel color values in parallel with one 
packed multiply-add instruction as set forth in Claim 23. 

On the contrary, Sun discusses alpha blending using the VIS™ instructions and 
uses the fact that a 2 is equal to (1- a t ) to algebraically transform (p. 116, lines 24-25, 
minor corrections to Sun 7 s algebra supplied): 
(si- s 2 )*aj + $2. 

Thus, Sun teaches away from computing the sum of two products as set forth in the 
packed multiply-add instruction of Claim 23, and teaches instead to perform a subtraction 
(vis_fpsubl6) a multiplication (vis_fmul8xl6) and an addition (visjpaddl6) to compute 
four sums in parallel with three VIS instructions (e.g. see p, 1 18, lines 20-22 and 25-27). 

As stated above, the general rule applicable to a rejection based on a combination 
of references stated in Schqffer, 108 USPQ at 328-329, is that it is not enough for a valid 
rejection to view the prior art in retrospect once an applicant's disclosure i$ known. The 
art applied should be viewed by itself to see if it fairly disclosed doing what an applicant 
has done. 

Since neither Sidwell nor Lee disclose or suggest a multiply-add instruction to add 
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a first and second products and a third and fourth products produce a packed result having 
two distinct elements, and since Sun not only fails to disclose or suggest but teaches away 
from computing the sum of two products as set forth in the packed roultiply-add 
instruction of Claim 23 7 appellant respectfully submits that the cited references do not 
fairly suggest, to one of skill in the art, a muJtiply-add instruction to add a first and 
second products and a third and fourth products produce a packed result having two 
distinct elements. 

Therefore, appellant respectfully submits that no suggestion is provided by 
Sidwell, Sun or Lee for doing what appellant has done. 

Because the multipJy-add instruction adds the first and second products and the 
third and fourth products to produce a packed result having two distinct elements, two 
PMAD instructions could produce the four alpha blended sums that required three VIS 
instructions, resulting in a 33% reduction of instructions, which is statistically significant 
Of practical significance is that the packed multiply-add to produce a packed result 
having two distinct elements provide support for applications such as alpha blending that 
require the sum of two products and not the sum of four products. 

Thus, the presence of multiply-add instruction to add a first and second products 
and. a third and fourth products produce a packed result having two distinct elements in 
the combined system of Sidwell, Sun and Lee would also be nonobvious. 

Accordingly in light of the above arguments, Claims 23-24 are not obvious in 
view of the cited references. 
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Conclusion 



Appellant submits that all claims now pending are in condition for allowance. 
Such action is earnestly solicited at the earliest possible date. If there is a deficiency in 
fees, please charge our Deposit Acct. No, 02-2666. 



Respectfully submitted, 



Date: $-/0~&S~ 




12400 Wilshire Boulevard 
Seventh Floor 

Los Angeles, CA 90025-1 026 
(408)720-8598 



42390P5943C -57- 
PAGE 70/198 * RCVD AT 8/10/2005 9:58:36 PM [Eastern Daylight Time] * SVR:USPT0-£FXRF-5/1 * DNIS:2738300 * CSID:408 720 9397 * DURATION (mm-ss):41-38 



08/10/2005 19:11 FAI 408 720 9397 



BST&Z 



@071 



VIU. Cl aims Appendix: Claims Allowed and Involved in Appeal (Clean Copy) 
1-15. (Cancelled) 

16. (Previously Presented) A processor comprising: 

a decode unit to decode a plurality of packed data instructions including a packed 
sum of absolute differences (PS AD) instruction having a first format to identify a first set 
of packed Hata, and a packed multiply-add (PMAD) instruction having a second format to 
identify a second set of packed data, said decode unit to initiate a first set of operations on 
the first set of packed data responsive to decoding the PS AD instruction and to initiate a 
second set of operations on the second set of packed data responsive to decoding the 
PMAD instruction; and 

an execution unit to perform a first operation of the first set of operations initiated 
by the decode unit and to perform a second operation of the second set of operations 
initiated by the decode unit 

17. (Previously Presented) The processor of Claim 1 6, wherein the decode unit further 
decodes a plurality of instructions of a PENTIUM microprocessor instruction set 

1 8. (Previously Presented) The processor of Claim 1 6, wherein the first set of 
operations comprises: 

a packed subtract and write carry (PSBWC) operation; 

a packed absolute value and read carry (PABSRC) operation; and 

a packed add horizontal (PADDH) operation. 

19-20. (Cancelled) 
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21 . (Previously Presented) The processor of Claim 1 6, wherein performing the first 
operation causes the execution unit to: 

produce a first plurality of partial products in a multiplier having a plurality of 
partial product selectors; 

insert an element of a first plurality of elements of a first packed data into and 
substituting for bit positions of one or more of the first plurality of partial products by 
using partial product selectors corresponding to the bit positions; and 

add the first plurality of elements together to produce a first result including a 
field comprising a sum of the first plurality of elements, said field having a least 
significant bit, 

22. (Previously Presented) The processor of Claim 21, wherein performing the first 
operation further causes the execution unit to: 

shift the first result to produce a second result having a least significant bit 
position and to align the least significant bit of the field with the least significant bit 
position of the second result 

23- (Previously Presented) A processor comprising: 

a decode unit to decode a plurality of packed data instructions including a packed 
sum of absolute differences (PSAD) instruction having a first format to identify a first set 
of packed data, and a packed multiply-add (PMAD) instruction having a second format to 
identify a second set of packed data, said decode uuit to initiate a first set of operations on 
the fust set of packed data responsive to decoding the PSAD instruction and to initiate a 
second set of operations on the second set of packed data responsive to decoding the 
PMAD instruction; and 

an execution unit to perform a first operation of the first set of operations initiated 
by the decode unit and to perform a second operation of the second set of operations 
initiated by the decode unit; 

wherein performing the first operation causes the execution unit to: 

produce a first plurality of partial products in a multiplier having a plurality of 
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partial product selectors, 

insert an element of a first plurality of elements of a first packed data into and 
substituting for bit positions of one or more of the first plurality of partial products by 
using partial product selectors corresponding to the bit positions, and 

add the first plurality of elements together to produce a first result including a 
field comprising a sum of the first plurality of elements, said field having a least 
significant bit; 

and wherein performing the second operation causes the execution unit to: 
produce a second plurality of partial products in the multiplier having the 
plurality of partial product selectors, the second plurality of partial products comprising 
four distinct sets of partial products including a first set of partial products corresponding 
to a first product for elements of the second set of packed data, a second set of partial 
products corresponding to a second product for elements of the second 1 set of packed data, 
a third set of partial products corresponding to a third product for elements of the second 
set of packed data, and a fourth set of partial products corresponding to a fourth product 
for elements of the second set of packed data; and 

add the first set of partial products together with the second set of partial 
products to produce a first distinct element of a packed result and add the third set of 
partial products together with the fourth set of partial products to produce a second 
distinct element of the packed result. 

24. (Previously Presented) The processor of Claim 23, wherein the second format 
identifies the second set of packed data as packed words. 

25- (Cancelled) 

26. (Previously Presented) A processor to execute instructions of the PENTIUM 
microprocessor instruction set, the processor comprising: 

decode logic to decode a packed sum of absolute differences (PSAD) instruction 
having a first format to identify a first set of packed data, said decode logic to initiate s 

42390P5943C -60- 



PAGE 73/198 * RCVD AT 8/10/2005 9:58:36 PM [Eastern Daylight Time] * SVKUSPT0-EFXRF-5I1 * DNIS:2738300 " CSID:408 720 9397 * DURATION (mifrSS]:41-38 



08/10/2005 19:12 FAX 408 720 9397 BST&Z @074 



first set of operations on the first set of packed data responsive to decoding the PS AD 
instruction; 

execution logic to perform a first operation of the first set of operations initiated 
by the decode logic; and 

a bus to provide the first set of packed data to the execution logic for performing 
of the first operation. 

27. (Previously Presented) The processor of Claim 26, wherein the decode logic 
comprises a look-up table. 

28. (Previously Presented) The processor of Claim 26, wherein the decode logic 
comprises integrated circuitry. 

29. (Previously Presented) The processor of Claim 28, wherein the decode logic 
further comprises executable operations. 

30. (Previously Presented) The processor of Claim 29. wherein the decode logic 
comprises: 

a packed subtract and write carry (PSBWC) operation; 

a packed absolute value and read carry (PABSRQ operation; and 

a packed add horizontal (PADDH) operation. 

31. (Previously Presented) The processor of Claim 26, wherein the first format 
identifies the first set of packed data as packed bytes. 

32. (Cancelled) 

33. (Previously Presented) The processor of Claim 26, wherein performing the first 
operation causes the execution logic to: 

produce a first plurality of partial products in a multiplier having a plurality of 
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partial product selectors; 

insert an element of a first plurality of elements of a first packed data into and 
substituting for bit positions of one or more of the first plurality of partial products by 
using partial product selectors corresponding to the bit positions; and 

add the first plurality of elements together to produce a first result including a 
field comprising a sum of the first plurality of elements, said field having a least 
significant bit. 

34. (Previously Presented) The processor of Claim 33, wherein performing the first 
operation further causes the execution logic to: 

shift the first result to produce a second result having a least significant bit 
position and to align the least significant bit of the field with the least significant bit 
position of the second result 

35. (Previously Presented) The processor of Claim 26, the decode unit to decode a 
packed multiply-add (PMAD) instruction having a second format to identify a second set 
of packed data, said decode unit to initiate a second set of operations on the second set of 
packed data responsive to decoding the PMAD instruction. 

36. (Previously Presented) The processor of Claim 35, execution unit to perform a 
second operation of the second set of operations initiated by the decode unit 

37. (Previously Presented) The processor of Claim 35, wherein the second format 
identifies the second set of packed data as packed words . 

38. (Cancelled) 

39. (Previously Presented) A processor comprising: 

decode logic to decode a packed sum of absolute differences (PS AD) instruction 
having a first format to identify a first set of packed data, said decode logic to initiate a 
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first set of operations on the first set of packed data responsive to decoding the PS AD 
instruction, the first set of operations comprising: 

a packed subtract and write cany (PSUBWC) operation; 

a packed absolute value and read carry (PABSRC) operation; and 

a packed add horizontal (PADDH) operation.; and 
execution logic to perform the first set of operations initiated by the decode logic. 

40. (Previously Presented) The processor of Claim 39, wherein the first format 
identifies the first set of packed data as packed bytes. 

41 . (Previously Presented) The processor of Claim 39, wherein performing the 
PSUBWC operation causes the execution logic to: 

subtract one of a plurality of elements of a first packed data of the first set of 
packed data from a corresponding one of a plurality of elements of a second packed data 
of the first set of packed data to produce a first result having a plurality of difference 
elements md a plurality of sign indicators; and 

store the plurality of difference elements and the plurality of sign indicators. 

42. (Previously Presented) The processor of Claim 39, wherein performing the 
PABSRC operation causes the execution logic to: 

receive a plurality of difference elements and a plurality of sign indicators; 
produce a result data having a plurality of absolute value elements, each absolute 
value element produced by 

(a) subtracting one of the plurality of difference elements from a corresponding 
constant value if the sign indicator corresponding to that element is in a first state, or 

(b) adding one of the plurality of difference elements to a corresponding constant 
value if the sign indicator corresponding to that element is in a second state. 

43. (Previously Presented) The processor of Claim 39, wherein performing the 
PADDH operation causes the execution logic to: 
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produce a first plurality of partial products in a multiplier having a plurality of 
partial product selectors; 

insert an element of a first plurality of elements of a first packed data into and 
substituting for bit positions of one or more of the first plurality of partial products by 
using partial product selectors corresponding to the bit positions; and 

add the first plurality of elements together to produce a first result including a 
field comprising a sum of the first plurality of elements, said field having a least 
significant bit. 

44. (Previously Presented) The processor of Claim 43, wherein performing the 
PADDH operation further causes the execution logic to: 

shift the first result to produce a second result having a least significant bit 
position and to align the least significant bit of the field with the least significant bit 
position of the second result 
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IX. Evidence Appendix: Copies of Evidence Relied Upon by Appellant 



Exhibit A 



L The "Pentium® Processor Family Developer's Manual, Vol. 3: Architecture and Programming 

Manual," 1995, pp. 25-165 and 25-166, cited by the Examiner in the Office Action (8.5) as extrinsic 
evidence for a common knowledge of the PENTIUM microprocessor instruction set bv one of 
ordinary skill in the an. 

The above cited reference was entered in the record by the examiner with the Office Action mailed on 
January 10, 2005. 

iL A November 1997 article by Eric Traut from BYTE, which discusses a Macintosh application that 
employs a ' 'Pentium instruction-set emulator, complete with MMX^ 1 instructions" 

III A definition of AMD from wordlQ.com, which explains (in the History section, paragraph 7) thai 
at some time about one year after AMD purchased NexGen in 199<5, "the K6 [processor] translated 
the Pentium compatible x86 instruction set to RISC-like micro-instructions." 

iv. John Savill 1 s FAQ (Frequently Asked Questions) for Windows web page, dated September 3, 
1999, which asks, "Do I really need 166Mhz Pentium processors to run SQL Server 7.0?" The 
answer given states, "No. But you DO need a 100% PENTIUM compatible chip - which rules Out 
some Cyrix and IBM processors." The page further explains (in paragraph 3) that, "speed of the 
processor doesn't matter as long as it runs the full penrium insoruction set " 

v. A current product description of a single-board computer from SBS technologies, which includes a 
"Penrium compatible Geode GX1 processor." 

vi A Department of Energy (hq.doe.gov) description of the Hardware & System Requirements for 
Microsoft Windows 2000 and Microsoft Office 2000 by IT Standards Manager, Carol Blackston, 
requiring a "133 MHz or higher Pendum-compadble CPU" for Windows 2000 Professional, a 166 
MHz Pentium-compatible CPU or higher" for Office 2000 Premium, and a *75 MHz Pentium- 
compaiible CPU or higher" for Office 2000 Professional or Office 2000 Standard. 

vii. Microsoft requirements for a Microsoft Operations Manager Server, a Database Server, a 

Reporting Server, or a SQL Server 2000 Reporting Services Server, listed as a ''PC with 550 MHz or 
higher Pentium-compatible; " an Administrator and Operator Console, listed as a "PC with 500 MHz 
or higher Pentium-compatible ;" and a Managed Computer, listed as a "PC with 200 MHz or higher 
Pentium-compatible ." 

viii An article by Tar an Rampersad from the Free Software Consortium (FSQ dated March 26, 2004, 
describing the basic system requirements of OpenOffice under Windows (98, NT, 2000, XP) 
including a "Pentium-cooipatible PC" 

ix. An article by Thomas Latuske posted June 8, 2004, describing two ways to retrieve the processor- 
speed ;wd staring (in paragraph 1) that, ''If you want ro use the function to calculate the speed 
(frequency), you have to use it with a Pentium instruction set compatible processor." 

The above cited references were entered in the record by the examiner with the Declaration submitted under 
37 CFR §1.132 on October 11,2004. 
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x. US Patent 5,859,789 (Sidwell) 

The above cited reference was entered in the record by the examiner with the Office Action mailed on 
August 19, 2003. 

xL Visual Instruction Set (VIS ™) User's Guide, Sun Microsystems, March 1 997 (Sun) 
xii, US Patent 5,721,697 (Ue), 

The above cited references were entered in the record by the examiner with the Information Disclosure filed 
by appellant on November 12, 2001. 

xffi. US Patent 5,938,756 (Van Hook). 

The above cited reference was entered in the record by the examiner with the Information Disclosure filed 
by appellant on November 26, 2001. 



42390P5943C 



PAGE 791198 * RCVD AT 8/101201)5 9:58:36 PM [Eastern Daylight Time] 11 SVR:USPT0-EFXRF-5/1 1 DNIS:2738300 1 CSID:408 720 9397 * DURATION (mm-ss):41-38 



08/10/2005 19:13 FAX 408 720 9397 



BST&Z 



11)080 



A-i 



THIS PAGE BLANK (uspto) 

42390P5943C 



PAGE 80/198 * RCVD AT 8/10/2005 9:58:36 PM [Eastern Daylight Time] * SVR:USPT0-EFXRF-5/1 ' DNIS:2738300 * CSID:408 720 9397 * DURATION (mm-ss):41<38 



08/10/2005 19:13 FAX 408 720 9397 



BST&Z 



@081 



Pentium® Processor Family 
Developer's Manual 

Volume 3: 

Architecture and Programming Manual 



NOTE: The Pentium® Processor Family Developer's Manual consists 
of three books: Pentium® Processor Older Number 241428; the 
82496/82497/82498 Cache Controller and 82491/82492/82493 Cache 
SRAM, Order Number 241429; and the Architecture and 
Programming Manual, Order Number 241430. 
Please refer to all three volumes when evaluating your design needs. 



1995 
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Information in this document Is provided In connection wilh JnLol products. No license, express OT Implied, by eatoppel or crtharwlae. to any intefectuaJ 
property rights is granted by ifte document Except as provided in Intel's Terms and Condhlone of Sale for such products, Intel assumes no I ability 
whatsoever, and Intel disctalma any express or vnpiad warranty, rotating to sale and/or use of Intel products including liebiiay or warranties relating to 
fitness for a particular purpose, merchantabillrv. or infringement of any patent, copyright or other intellectual property right. Intel products are not 
intended for use k\ medical, lie saving, or fife sustaining applications. 

Intel may make changes to specifications and product descriptions at any lime, without nofice- 

Tha Pentium* processor may contain design detects or errors Known as errata. Current characterized errata era available on request 
"Third-party Brands and names era the property of their respective owners. 

Contact your local Intel sales office or your distributor to Obtain the latest specifications and before placing your product order. 

Copies at documents which have an ordering number and are referenced in ihia document, or other Intel literature, may be obtained from: 

Intel Corporation 
P.O. BOX7H41 
ML Prospeci, 1L GO 056-7641 

Or cat) 1-804479-4683 

COPYRIGHT C \Ktf=L ttftPOftATON IflW 
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INSTRUCTION SET 



IMUL— Signed Multiply 



F6/5 




11 


F7/5 


IMUL J#nf6 


11 


F7/S 


IMULC^O? 


10 


OFAF/r 


IMUL r16rfn16 


10 


OFAF/f 


IMULr^n32 


10 






10 




IMUL t32jfrrt32,\HMn3 


10 


6B/rib 


IMUL M$Jmm$ 


10 


QB/rJtb 


\MULt32Jmm8 


10 


69/rnv 


IMULrffi,// 


10 










IMUL r32,tf 


10 




m32,imm32 






IMUL r16 t anm16 


10 




IMUL r$2.lmm$2 


10 



Description 
AX<-AL* //mbyte 

EDXEAX EAX * irtn dward 

word reglaler «- word register ■ r/m word 

dword resistor <- dword register " rfm dword 

word register «- /rtnfff * sign-extended Immediate 

byte 

dword register «- r/m32* sign-extended 
immediate byte 

word register 4- word register * sign-extended 
Immediate byte 

dword register 4- dword register * sign-extended 
immediate byte 

word register <- rfm16* immediate word 

dword reg is t er «- rfm32* Immediate dword 

word ragkrtor<- rfn)16 w immediate word 
dword register <- dm&* 



Operation 

result <- multiplicand * multiplier; 



Description 

The IMUL instruction performs signed multiplication. Some forms of the instruction use 
implicit register operands. The operand combinations for all forms of the instruction are 
shown in the "Description" column above. 

The IMUL instruction clears the OF and CF flags under the following conditions (otherwise 
the CF and OF flags are set): 



Instruction Form 


Condition tor Clearing CF and OF 


r/lma 


. AL = sign-extend of AL to 1 6 bits 


r7rn16 


AX - sign-extend of AX to 32 bits 


r/m32 


EDXEAX = sign-extend of EAX to 32 bits 


r16,r/m16 


Result exactly fits within rl 6 


r/32,r/m32 


Result exactly fits within r32 


n6,r/ni16,jnTm16 


Result exactly flis within M6 


r32.r/m32.imm32 


Result exactly fits within r32 



Flags Affected 

The OF and CF flags as described in the table in the 'Description" section above; the SF, ZF, 
AF, and PF flags are undefined 
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INSTRUCTION SET 

Protected Mode Exceptions 

#GP(0) for an illegal memory operand effective address in the CS, DS, ES, FS, or GS 
segments; #SS(0) for an illegal address in the SS segment; #FF(fauh-eode) for a page fault; 
#AC for unaligned memory reference if the current privilege level is 3. 

Real Address Mode Exceptions 

Interrupt 13 if any part of the operand would lie outside of the effective address space from 0 
to 0FFFFH. 

Virtual 8086 Mode Exceptions 

Same exeptions as in Real Address Mode; #PF(foult-code) for a page fault; #AC for 
unaligned memory reference if tike current privilege level is 3. 

Notes 

When using the accumulator forms (IMUL r/mB, IMUL r/m16, or IMUL r/m32), the result of 
the multiplication is available even if the overflow flag is set because the result is twice the 
size of the multiplicand and multiplier. This is large enough to handle any possible result 
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Building the Virtual PC 



Novem be r 1997 / Core T echnologies / Building the Virtual PC 

A software emulator shows that the PowerPC 
can emulate another computer, down to its 
very hardware. 

Eric traut 

Development of Virtual PC — Connectix Corporation's Macintosh 
application that emulates a PC and its peripherals — began almost two 
years ago, in October 1995. The goal from the beginning was to 
create a fully Intel-compatible PC in software. The effort centered 
around a core Pentium instruction-set emulator, complete with MMX 
instructions. True PC emulation also required the reverse-engineering 
and development of a dozen other PC motherboard devices, including 
modern peripherals such as an accelerated SVGA card, an Ethernet 
controller, a Sound Blaster Pro sound card, IDE/ATAPI controller, and 
PCI bridge interface. This strategy of hardware-level emulation 
resulted in an application that allows Macintosh users to run not only 
Windows programs and DOS games but several x86-based OSes, 
including Windows 95, NT, and NeXT OpenStep- 

Pentium Emulation 

The heart of Virtual PC is the Pentium recompiling emulator, a 
sophisticated piece of software written entirely In hand-coded PowerPC 
assembly language- Its job is to translate Pentium instruction 
sequences into a set of optimized PowerPC instructions that perform 
the same operation- Translation occurs on a "basic-block" basis, where 
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a basic block consists of a sequence of decoded x86 instructions, Basic 
blocks end on an instruction that abruptly changes the flow of 
execution (typically a jump, call, or return-from-subroutine 
instruction). As the recompiler decodes x86 instructions, it analyzes 
them for "condition code" u sage. Finally, it generates a block of 
PowerPC code that accomplishes the same task. For more details on 
this process, see "Virtu al pc Operat ion", . 

For purposes of speeding things up, the emulator employs the 
following tricks. 

Translation cache: Even though written in PowerPC assembly 
language, the translator still requires substantial time to generate 
optimized instruction translations. To reduce this overhead, the 
emulator caches blocks of translated code. 

Interinstruction optimization: Because the Pentium is a CISC 
processor, most instructions perform more than one operation. For 
example, the ADD instruction not only adds two values together, it 
also produces a number of condition- code flags that tell programs 
whether the addition produced a zero or negative result. Such codes 
are used, for example, to determine if a program performs a 
conditional jump. Most of the time these codes are ignored. The 
translator analyzes blocks of x86 instructions to dete rmine which 
flags the program uses (if any). It then generates PowerPC code for 
those flags actually used. The first two listings in 'Translated Code" 
show how one Pentium instruction translates into three PowerPC 
instructions, while three Pentium instructions can be optimized from 
nine into five PowerPC instructions. 

Address translation: One of the most difficult Pentium features to 
emulate is its built-in memory management unit (MMU). This 
hardware translates linear (or logical) addresses into physical memory 
addresses. Operating systems use the MMU to implement virtual 
memory and memory protection. Because of the Pentium's small 
register file, about three in four Pentium instructions reference 
memory in one way or another. Each memory address potentially 
needs to be translated before the emulator loads from, or stares to, 
the referenced address. An MMU implemented in software would 
impose a high overhead, which would degrade performance. Luckily, 
this overhead can be avoi ded: The Connectix engineers were able to 
program the PowerPCs MMU to mimic the Pentium MMU's behavior, 
thus managing the address translations in hardware. The Pentium's 
memory page attributes can also be mirrored in the PowerPCs MMU. 
For example, if Virtual PCs emulated OS marks a memory page as 
write-protected, the page mappings are modified so the corresponding 
PowerPC page is write-protected. 

Segment bounds checking: The Pentium architecture includes the 
archaic notion of memory segments. Every memory reference, such as 
instruction fetches, stack operations, loads, and stores, has an 
associated memory segment. When a segment's bounds are 
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exceeded, the Pentium's MMU generates a general protection fault 
(GPF). The OS uses GPFs for more than detecting bugs in applications: 
They enable a program to "thunk" down into privileged driver-level 
code not accessible at the application level- Therefore, the Pentium 
emulator must detect segment bound faults where appropriate. 
Although the PowerPC does not contain segmentation hardware akin 
to the Pentium, Connectix used PowerPC crap instructions to perform 
segment bounds checks with little or no overhead. 

Hardware Emulation 

Besides the Pentium processor, a typical PC motherboard contains a 
dozen or so chips that work together concurrently. All these chips 
need to be emulated faithfully for compatibility. The Intel architecture 
provides an I/O address space that's used to access hardware outside 
of the CPU. You work with this "I/O space" through two Instructions — 
in and out. When using these instructions, software must specify an 
I/O port (or address). Virtual PC routes I/O accesses to code modules 
that emulate each chip- For example, if Virtual PC encounters an IN 
instruction referencing port 0x21, it calls a routine In the interrupt- 
controller emulation module that returns the current interrupt mask. 
Similar module calls occur for every I/O space access, as the third 
listing in ''Trans lat ed Code" shows. 

Many of the extra chips on a PC motherboard control I/O devices such 
as the hard drive, CD-ROM, keyboard, and mouse. For compatibility 
with the Mac OS and all Macintosh hardware, Virtual PC performs all 
I/O through the standard Mac OS drivers. So, a request sent to the 
emulated PC's IDE controller to read a sector from the hard drive gets 
translated into a read operation that's sent to the Mac OS SCSI driver. 

The most difficult hardware components to emulate involve precise 
timing. For example, sound is a real-time operation, and any timing 
perturbation results in clicks or pops as digitally sampled data fails to 
arrive on time. Because Virtual PC is hosted on the Mac OS (which 
gives time to other Mac programs running concurrently, as well as 
Virtual PC), and it needs to emulate several dozen PC chips in parallel, 
precise timing isn't always possible. Virtual PC compensates by placing 
the highest priority on tasks that directly affect the user, such as 
sound and video. 

Performa nee 

Emulated systems are naturally going to be slower than real 
hardware. But Connectix engineers concentrated on tuning aspects of 
the emulated hardware required to run popular PC games and 
productivity applications at a usable performance level. This was 
especially challenging given that the PowerPC processor emulates not 
only the Pentium but all the other chips on a PC motherboard. 

Performance of Virtual PC is also greatly affected by the host 
hardware system. The latest PowerPC processors with high dock rates 
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and large on-chip caches will run it best. The speed and size of the 
system's L2 cache is also critical because of the code expansion that 
occurs during the translation process. 

While users will take a performance hit because this is an emulator, 
Virtual PC successfully emulates the entire PC at a very low level. PC 
programs — applications, device drivers, and operating systems alike - 
- cannot tell they are not running on actual PC hardware. 



Tr anslate d Co de 



Tr^ja illation of Single Pentium Instruction 



Pentium instruction 


PowerPC 


instructione 


ADD EAX.20 


li 


rTcmpl,20 




addco. 


PF,rTempl,rBAX 




mr 


rBAX, rPP 


Translation of Pentium In 8 


traction Block 




Pentium instructions 


PowerPC 


instructions 


ADD EAX.20 


add 


rEAX, rRRX, 2 0 


ADD EBX,30 


add 


rEBX,rBSX f 30 


ADD ECX,40 


li 


rTempl, 40 




addco . 


rFF,rTcnipl,rECX 




mr 


tfECX.rPF 


Coda Translation for Pentium I/O Instructions 


Pentium instructions 


PowerPC 


instructions 


MOV AL,B 


li 


rAL, B 


MOV DX,0xlF0 


li 


rDX.OxlFO 


QDT DX,AL 


bl 


Handle IDE portw^i z e 


AD 






D DX.7 


addi 


rDX.rDX,? 


IN AL,DX 


bl 


Handle IDE Port Re ad 


RET 


addi 


rIP,rIP, 8 




b 


pi spatchTotrextBlock 



Virtual P C Operation 



illustration link ( 2 4 Kbytes! 




Eric Tn*ut ( tr3ut@con n ectix.com ) is lead engineer for Virtual PC at 
Connectix. At Apple Computer, he wrote the 680x0 dynamic recompilin 
emulator for PowerPC-based Macs. 
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Pr ogrammin g 
Languages 

In this issue of Best 
of BYTE, we bring 
together some of the 
leading 
programming 
language designers 
and implementors... 



Copyright a 2005 CMP Media LLC, Privacy Policy , Your California P rivacy 

rights, Terms o f Service 

Site comments: web mast e r (3) byte .com 

SDMG Web Sites: BYTE.com, C/C++ Users Journal, Dr. Dobb's Journal, 
MSDN Magazine . IMevy Ar chitect, SD Exp o. SD Magazine . Sys Ao'min, 3M 
Perl Journal , UnixReview.com, Win dows Developer Network 
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InslarrtDoc ID 



Advanced Sea 



Home Forums Euents Topics Publications Articles | Blogs IT Prockjcts & S 



MEET SECURITY THREATS A 


V ' Find the tools and guidance you need 
' r far a . wethguarded network ► 


Mictt 


\ Free Tools ft Updates 


KA Server 2094 


Widows XP 



ISQLJ 

Do I really need 166Mhz P e ntium processors to run SQL Server 
7.0? 

Neil Pike 

InstantDoc #14153 
September 3. 1 999 




: MEET SECURITY THREATS 



A. No. But you DO need a 100% 
PENTIUM compatible chip - which 
rules out some Cyrix and IBM 
processors . The only way around this 
is for the chip vendor to offer a micro- 
code upgrade. (Some non-Intel chips 
say they are pentiums, but in fact 
only implement the 486 chip-set). 

The following quote is from Cyrix - 
"Recently an issue with SQL Server 
7.0 has been discoverd with the non- 
MMX Media GX and the 6x86 
processors. A fix for this issue can be 
obtained from Cyrix technical support 
at: tech_support@cyrix.com" 



The actual speed of the processor 
doesnt matter as long as it runs the 
full pentium instruction set - it needs to support CMPXCHG8B (Compare and Exchange 8 bytes) 
and RDTSC (Read Time-Stamp counter) instructions. Microsoft have made this a requirement 
because it is the minimum spec machine that they have developed/tested with - which is ok if 
you get most of your equipment donated/loaned/replaced by hardware companies free of 
charge, but this isnl the case with most businesses ! 

As long as the server previously ran SQL 6.5 (and is 100% PENTIUM compatible) you should 
find that it will run SQL 7.0 and will offer significant performance improvements, so dont upgrade 
hardware for the sake of it. 

The following quote is from Microsoft Product Support Services :- 
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"When using SQL Server v7.0, Microsoft recommends a processor speed of 166Mh2 or higher 
for server machines. Our extensive testing of the product has been done on machines of this 
calibre and we believe customers will get a better price performance with the product when used 
in this configuration. Microsoft will support SQL Server v7.0 when run on server machines with 
slower processors. However customers should recognise that if our findings are that major 
problems can be eliminated by using faster processors we will continue to recommend, and in 
some cases may require, compliance with this 
suggestion." 

The reason for this caveat is that some of the decisions the optimiser makes on a 166Mhz 
pentium may not make so much sense on a 60Mhz pentium - i.e. the extra cpu time a 60Mhz 
part neeed may mean that a non-optimal plan had been chosen. 



Featured Links 

Learn To Migrate to SQL Server 2005 

SQL Server 2005 Roadshow coming to your city. Register nowt 
AUGUST SPECIAL-Get 44%off Windows IT Pro 

Sign up now and and start getting quick answers to your 
Windows questions 

15-Minute Failover Solution for Exchange 

Attend and win a $50 gift certificate to Best Buy 
Get 2 Free Issues of SQL Server Magazine 
Put SQL Server Magazine to the test - you wont want to miss a 
single issue! 

Best Practices for the Mobile Enterprise 

Identify the key security considerations for wireless mobility . 
Compliance Vs Recovery - New Web Seminar 
Integrate your compliance system with backup and recovery 
Windows Innovators Contest 2005 

Deadline Extended - Win a trip to Exchange Connections! 



Ads bv Google 


StyUsrUDini 

OrdBr Print Online. Fun Designs and 
Cheap Prices. 
www.GlftChecka.CQm 


WjePrjnt.Eyerything 
Top Quality, Competitive Pricing, 
Personal Service! (212) 967-8900 
venusprhrtmg.btobsourttxom 


Ail actual screen questions 100% 
pass at first last 
www.examkiller.net 


Articles 

Compare - Articles Online M; 
Analysis 

www.BiasiSe6k.com 



Our Other Websites: 

CertTutor | Connected Home | JSI FAQ | IT Library/eBooks ! SuperSite | Windows FAQ | Winmfo News | Europe 

Edition | MSD2D 



Home | Subscribe / Register | AbouLUs I Cont act Us / Customer Service | Affiliates / Licens ing | Press Room | 

Media Kit | 

Windows IT Pro is a Division of Penton Media Inc. 
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Windows IT Pro Marketplace 

Argent versus MOM 2005 

Download Argent Versus Microsoft Operations Manager 
2005 

Exchange Preventative Maintenance 

Find out how -- download your FREE Essential Guide now! 
Tech jobs at Dice 

Search 65K+ new IT jobs daily-Tech expert jobs at top 
companies) 

Do you need an Email Compliance Policy? 

Where do you start - download the wMepaper now. 
A new dimension in IT 
Access to KVM, serial console, and power control under 
one platform. 

Test Your Security Configuration 

Identify and tlx vulnerabilities - download the free 
whitepaper now! 

Backup Windows Servers - Free Trial 

Cut server backup costs in half with Backup for 
Workgroups! 
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[x] Energy CIO/Information Technology Standards Program logo 



Hardware & System Requirements 




for 


Microsoft Windows 2000 and Microsoft Office 2000 




Microsoft Harware Requirements (In Word) 


Operating System (OS) 


Hardware/System Requirements: I 


Windows 2000 Professional 


CPU 

133 MHz or higher Pentium-compatible CPU 
300 MHz recommended 




Supports single and dual CPU systems 




NOTE: Check driver availability for peripheral devices. 




RAM 




64 MB (minimum) -4 GB RAM (maximum) 




128 recommended 




Disk Space: 




2 GB Hard disk 




650 MB free installed on standalone PC 




700 MB if installed over server 




1G recommended 


For more detailed information, visit 


the Microsoft web site: 


www, microsoft.com/toind o ws2000/u pqrade/u pa rade reas/def au 1 1 ssp 


Applications 


Hardware/System Requirements: 


Office 2000 Premium 
includes: Word, Excel, 
PowerPoint, Outlook, Access, 
FrontPage, Publisher, Small 
Business Tools, PhotoDraw 


For installing a typical configuration to the local PC 
CPU: 

166 MHz Pentium-compatible CPU or higher 
MS Windows 95 or later OS or 
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MS NT Workstation OS ver. 4.0: Service pack 3 or later 




300 MHz Pentium-compatible CPU or higher recommended 




RAM: 




16 MB RAM tor Windows 95/98 OS and 32 MB for Windows 
NT Workstations 




Additional 4 MB RAM for each application running 
simultaneously, except: 




8 MB required for Outlook, Access, FrontPage 




16 MB required for PhotoDraw 




Recommended total 256 MB to accommodate Office suite, 
other applications, toolbar, etc. 




Disk Space: 




252 MB for Word, Excel, PowerPoint, Outlook, Access, 
FrontPage 




174 MB Published, Small Business Tools 




100 MB PhotoDraw 




CD-ROM required for non-network Installations 




Network Card required for network installations 


Office 2000 Professional 
includes: Word, Excel, Outlook, 
PowerPoint, Access, Publisher. 
Small Business Tools 


For installing a typical configuration to the local PC 
CPU: 

75 MHz Pentium-compatible CPU or higher for 
MS Windows 95 or later OS or 

MS NT Workstation OS ver. 4.0; Service pack 3 or later 
300 MHz Pentium-compatible CPU or higher recommended 
RAM: 

16 MB RAM for Windows 95/98 OS and 32 MB for Windows 
NT Workstations 

Additional 4 MB RAM for each application running 
simultaneously, except 
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8 MB required for Outlook 
8 MB required for Access 

Recommended total 256 MB to accommodate Office suite, 
other applications, toolbar, etc. 

Disk Space: 

217 MB Word. Excel. PowerPoint. Outlook Access 
174 MB Publisher, Small Business Tools 
CD-ROM required for non-network installations 
Network Card required for network installations 


UIUlc MUyfif ^isuiiizirxi 


For installing a typical configuration to the local PC 




CPU: 


includes. vvoFu, cacsi, i_/utiuok., 




PowerPoint 


75 MHz Pentium-compatible CPU or higher for 




MS Windows 95 or later OS or 




MS NT Workstation OS ver. 4.0; Service pack 3 or later 




300 MHz Pentium-compatible CPU or higher recommended 




RAM: 




16 MB RAM for Windows 95/98 OS and 32 MB for Windows 




NT Workstations 




Additional 4 MB RAM for each application running 




simultaneously, except: 




8 MB required for Outlook 




Recommended total 256 MB to accommodate Office suite, 




other applications, toolbar, etc. 




Disk So ace - 




189 MB Word, Excel. PowerPoint, Outlook 




CD-ROM required for non-networfc installations 




Network Card required for network installations 


For more detailed information, visit the Microsoft web site: 
www. microsoft com/of fice/svs req. htm 
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^Windows Server System 

Product- Informatio n 

MOM 2005 System Requirements 

Updated: August 25, 2004 

Find the recommended hardware and software requirements for MOM 2005 and MOM 2005 Workgroup Edition. 
The page includes requirements for running a MOM server, database, consoles, and managed computers. 

» 8 ft 

On This Page 

* System Requir ements for each M OM Server 
^ Database Serv er Requirements 

* Administrator and Operator Console Requirements 

* M anaged Computer Requireme nts for each Computer. 
^ MOM Reporting Server Requirements 

* SQL Server 2000 Re port ing Services Server Requirements 



System Requirements for each MOM Server 



Requirement 


MOM 2005 


MOM 2005 Workgroup Edition 


Processor 


PC with 550 MHz or higher 
Penti u m -compatible 


PC with 550 MHz or higher 
Pentium-compatible 


Operating system 


Any of the following: 

# Microsoft Windows Server 
2003, Standard Edition with 
the latest service pack 


Any of the following: 

# Microsoft Windows Server 
2003, Standard Edition with 
the latest service pack 




m Microsoft Windows Server 
2003, Enterprise Edition 
with the latest service pack 


# Microsoft Windows Server 
2003, Enterprise Edition 
with the latest service pack 


i 


# Microsoft Windows Server 
2003, Datacenter Edition 
with the latest service pack 


e Microsoft Windows Server 
2003, Datacenter Edition 
with the latest service pack 




# Microsoft Windows 2000 
Server with the latest 
service pack 






9 Microsoft Windows 2000 
Advanced Server with the 
latest service pack 




r 
i 


^ Microsoft Windows 2000 
Datacenter Server with the 
latest service pack 




l 

! Database software 

: 

1 


N/A 


Any of the following: 



Related Links 

♦ Ho:wjQ..Ruy. 

, MOM 2005 Product 
Overview 

• Compare MOM 2005 and 
Wprkg roup_.EditiQQ 
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; 

1 




4 Microsoft SQL Server 2000 
Desktop Engine (MSDE) 






m Microsoft SQL Server 2000 
Standard Edition 






4 Microsoft SQL Server 2000 
Enterprise Edition 


j Memory 


512 megabytes (Me) 


512 MB 


j. , 

Hard disk 


5 GB 


5 GB 


! Hardware 

• 


• CD-ROM drive ' 


* CD-ROM drive 


• 


* Network adapter 


• Network adapter 




• Microsoft Mouse or 

compatible pointing device 


• Microsoft Mouse or 

compatible pointing device 



"r Top oF page 



Database Server Requirements 



! Requirement 


MOM 2005 


MOM 2005 workgroup Edition 


j Processor 


PC with 550 MHz or higher 
Pentium-compatible 


IM/A 


! Operating system 


Any of the following: 

* Microsoft Windows Server 
2003, Standard Edition with 
the latest service pack 


N/A 




• 


Microsoft Windows Server 
2003, Enterprise Edition 
with the latest service pack 






• 


Microsoft Windows Server 
2003, Datacenter Edition 
with the latest service pack 






• 


Microsoft Windows 2000 
Server with the latest 
service pack 






m 


Microsoft Windows 2000 
Advanced Server with the 
latest service pack 






m 


Microsoft Windows 2000 
Datacenter Server with the 
latest service pack 




j Memory 


512 MB 


N/A 


j Hard disk 


5 GB 


N/A 
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Database 



| 

[ Hardware 



Any or the following: 

* Microsoft SQL Server 2000 
Standard Edition 

# Microsoft: SQL Server 2000 
Enterprise Edition 



• CO-ROM drive 

• Network adapter 

« Microsoft Mouse or 

compatible pointing device 




* Too of page 



Administrator and Operator Console Requirements 



1 Requirement 



Processor 



Operating system 



MOM 2005 



PC with 500 MHz or higher 
Pentium- compatible 



Any of the following: 

m M icrosoft Wl ndows XP 
Professional with the latest 
service pack 

# Microsoft Windows Server 
2003, standard Edition with 
the latest service pack 

# Microsoft Windows Server 
2003, Enterprise Edition 
with the latest service pack 

# Microsoft Windows Server 
2003, Datacenter Edition 
with the latest service pack 

m Microsoft Windows 2000 
Server with the latest 
service pack 

# Microsoft Windows 2000 
Professional with the latest 
service pack 

m Microsoft Windows 2000 
Advanced Server with the 
latest service pack 

# Microsoft Windows 2000 
Datacenter Server with the 
latest service pack 



MOM 2005 Workgroup Edition 



PC with 500 MHz or higher 
Pentium-compatible 



Any of the following: 

# Microsoft Windows XP 
Professional with the latest 
service pack 

^ Microsoft Windows Server 
2003, Standard Edition with 
the latest service pack 

t Microsoft Windows Server 
2003, Enterprise Edition 
with the latest service pack 

# Microsoft Windows Server 
2003, Datacenter Edition 
with the latest service pack 
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i Memory 


128 MB (minimum) 


128 MB (minimum) 


; Hard disk | ISO MB 


150 MB 


j Monitor resolution 


1024x766 or higher 


1024x768 or higher 


) Software 


Microsoft .NET Framework 1.1 or 
later 


Microsoft .NET Framework 1.1 or 
later 


| Hardware 


• Network adapter 


• Network adapter 


i 
i 

i 


« Microsoft Mouse Or 

compatible pointing device 


• Microsoft Mouse or 

compatible pointing device 



t IopjtfLpase 



Managed Computer Requirements for each Computer 



f ■ 

; Requirement 


MOM 2005 


MOM 2005 Workgroup Edition 


Processor 


PC with 200 MHz or higher 


PC with 200 MHz or higher 




Pen tlu m- co m patl ble 


Pentiu m-com pa table 


Operating system 


Any of the following: 


Any of the following: 




• 


Microsoft Windows XP 


• 


Microsoft Windows XP 






Professional with the latest 




Professional with the latest 






service pack 




service pack 




• 


Microsoft Windows Server 


9 


Microsoft Windows Server 






2003. Standard Edition with 




2003, Standard Edition with 






the latest service pack 




the latest service pack 






Microsoft Windows Server 




Microsoft Windows Server 






2003, Enterprise Edition 




2003, Enterprise Edition 






with the latest service pack 




with the latest service pack 






Microsoft Windows Server 


• 


Microsoft Windows Server 






2003, Datacenter Edition 




2003, Datacenter Edition 






with the latest service pack 




with the latest service pack 




• 


Microsoft Windows Server 


• 


Microsoft Windows Server 






2003, web Edition with the 




2003, Web Edition with the 






latest service pack 




latest service pack 




• 


Microsoft Windows Small 


• 


Microsoft Windows Small 






Business Server 2003 with 




Business Server 2003 with 






the latest service pack 




the latest service pack 




• 


Microsoft Windows 2000 


• 


Microsoft Windows 2000 






Server with the latest 




Server with the latest 






service pack 




service pack 




m 


Microsoft Windows 2000 


• 


Microsoft Windows 2000 






Professional with the latest 




Professional with the latest 






service pack 




service pack 




• 


Microsoft Windows 2000 


• 


Microsoft Windows 2000 






Advanced Server with the 




Advanced Server with the 
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1 

1 

! 
i 

i 

• 
• 

i 

i 

i 


latest service pack 

e Microsoft Windows 2000 
Datacenter Server with the 
latest service pack 

a Microsoft Windows NT 4.0 
Server with the latest 
service pack (agent-less 
monitoring only) 

# Microsoft Windows NT 4.0 
Server Enterprise Edition 
with the latest service pack 
(agent-less monitoring only) 

e Microsoft Windows NT 4.0 

Edition with the latest 
service pack (agent-less 
monitoring only) 


latest service pack (agent- 
less monitoring only) 

m Microsoft Windows 2000 
Datacenter Server with the 
latest service pack (agent- 
less monitoring only) 

9 Microsoft Windows NT 4.0 
Server with the latest 
service pack (agent-less 
monitoring only) 

9 Microsoft Windows NT 4.0 
Server Enterprise Edition 

(agent-less monitoring only) 

# Microsoft Windows NT 4.0 
Server Terminal Server 
Edition with the latest 
service pack (agent-less 
monitoring only) 


j Memory 


128 MB (minimum) 


128 MB (minimum) 


| Hard disk 

: 


100 MB 


100 MB 


MOM Reporting Server Requirements 


Requirement 


MOM 2005 


MOM 2005 Workgroup Edition 


Processor 


PC with 550 MHz or higher 
Pentium-compatible 


N/A 


Operating System 


Any of the following: 

# Microsoft Windows Server 
2003, Standard Edition with 
the latest service pack 

m Microsoft Windows Server 
2003, Enterprise Edition 
with the latest service pack 

# Microsoft Windows Server 
2003, Datacenter Edition 
with the latest service pack 

9 Microsoft Windows 2000 
Server with the latest 
service pack 

e Microsoft Windows 2000 
Advanced Server with the 
latest service pack 


N/A 



PAGE 1161198" 



RCVD AT 8/10/2005 9:58:36 PM (Eastern Daylight Time] * SVR:U$PT0-EFXRF-5/r DfOS:2738300 * CSID:408 720 9397 ' DURATION (mm-ss):41 -38 2005 



08/10/2005 19:20 FAX 408 720 9397 BST&Z 

Microsoft Operations Manager 2005 System Requirements 



©117 

Page 6 of 8 



I 9 Microsoft Windows 2000 
j Data center Server with the 

J latest service pack 

; i 




I Database software 

[ 

f 

f 
i 

f 
j 

j 
i 

i 

! 
i 
i 
I 
I 
\ 


Any or the following: 

^ nicro50ic dv^l jcrvcr &uuu 
Standard with the Service 
Pack 3.0a or later 

m Microsoft SQL Server 2000 
Enterprise Edition with 
Service Pack 3.0a or later 


IM/A 


J 

! Memory 

j 


512 MB of RAM (l GB or higher 
recommended) 


N/A 


; Hard disk 

j 


200 GB of available hard disk 
space 


N/A 


Hardware 

i 

i 

r 
i 

i 

t 

t 
t 
1 
t 


Any of the following: 

+ CD-ROM drive or DVD-ROM 
drive 

m Keyboard and mouse or 
compatible pointing device, 
or hardware that supports 
console redirection 

• Network Adapter 


N/A 



Tpp of page 



SQL Server 2000 Reporting Services Server Requirements 

Note: MOM 2005 Reporting utilizes SQL Server 2000 Reporting Services - you will need to install and configure 
SQL Server 2000 Reporting Services to view MOM 2005 Reports. 



t 

Requirement 


MOM 2005 


MOM 2005 Workgroup Edition 


Processor 


PC With 550 MH? or higher 
Pen b" u m -com pa tlble 


N/A 


Operating System 


Any of the following; 

# Microsoft Windows Server 
2003, Standard Edition with 
the latest service pack 


N/A 




• 


Microsoft Windows Server 
2003/ Enterprise Edition 
with the latest service pack 






• 


Microsoft Windows Server 
2003, Datacenter Edition 
with the latest service pack 






• 


Microsoft Windows 2000 
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j Server with the latest 




j service pack 




1 

i j Microsoft Windows 2000 




! 


Advanced Server with the 






latest service pack 






# Microsoft Windows 2000 






Datacenter Server with the 






latest service pack 




Database software 


Any of the following: 


N/A 




# Microsoft SQL Server 2000 






Standard with the Service 






Pack 3-0a or later 






m Microsoft SQL Server 2000 






Enterprise Edition with 






Service Pack 3.0a or later 

















Other Software 


Any of the following: 


N/A 




m Internet Information 






Services (11S) Server 6.0 






must be installed as part of 






the Windows Server 






installation 






Microsoft SOL ^erv^r 7000 

A I'll ^— * l_l_3*J 1 1_ ^V^l. iJWl ¥CI 






Reporting Services can 






render reports in HTML 3.2 






and HTML 4.0. To view MOM 






reports you must have one 






of the following browsers: 






Microsoft Internet Explorer 






6.0 with Service Pack 1 






Microsoft Internet Explorer 






5.5 with Service Pack 2 






MirtrA^ftft Internet ExDlorer 

i'ii^j uaui till lci i ich ci 






5.01 with Service Pack 2 






Netscape 7.0 






Netscape 4.78 






^ Microsoft Visual Studio .NET 






2003, or Integrated 






Developer Environment 






2003 (if you want to 






customize or create reports) 




Memory 


256 MB of RAM (16B or higher 


N/A 




recommended) 




| Hard disk 


10 GB of available hard disk 


! N/A 


i 

! 
i 


space 




! i 
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I Hardware 



Any of the following: 

# CD-ROM drive or DVD-ROM 
drive 

« Keyboard and mouse or 
compatible pointing device, 
or hardware that supports 
console redirection 

• Network Adapter 



N/A 
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ARJTHM1CTIC UNIT packed objects and the intention is to cany out the same 

arithmetic operation on respective pairs of objects m differ 7 
HELD OF THE INVENTION enl operands. The first and second data strings can constitu ic 

, . . , JL . . . a single operand or can be separate operands 

This invention relates to an anthmclic un.i for use in a f ^ c embodiment, the addition circuitry com- 

computer system. prises a Erst set of adder circuils each arranged to add 

BACKGROUND TO THE INVENTION together the outputs of a distinct two of said n mulupUcation 

circuils, and a further adder circuit arranged to add together 
Arithmetic unite are those which carry out an arithmetic outputs of each of the first set of adder circuits to provide 
operation in response to execution of an arithmetic jn&lruc- 10 result. 

tion. Such iDstructjnnfi include an add instruction, a rnidtiply i n fte described embodiment, the input buffer means 
instruction, a divide instruction and a subtract instruction. It comprises first and second input buffers each arranged to 
is common to have in a computer system a so-called ALU hold a respective one of the first and second data string 
(arithmetic logic unit) which is capable of implementing any However, only one input b uller is needed in the case where 
one of these arithmetic instructions. i$ pairs of objects wi thin an operand are to be multiplied 

There are Crequent occasions when it is required to together. It will readily be apparent ibat it is an advantage of 
multiply together pairs of objects and to add together the tbe invention that the output buffer means can have a 
resulting products, This is presently done by effecting a capacity which is less than the input bu Jcr means, 
multiplication operation on each pair of objects, storing the The invention also provides a computer comprising a 
results of the multiplication operations in a register file and 20 processor, memory and data storage circuitry for holding 
subsequently executing an addition instruction which recalls data strings each comprising n sub-strings representing 
the earlier generated results from the register file and adds respective objects, wherein said processor comprises an 
them together, finally loading tbe result back to the register arithmetic unit as defined above and wherein there is stored 
file- One problem with this arrangement 75 that the length of jn said memory a sequence of instructions comprising at 
the words resulting from the multiplication operations are 25 least an instruction to multiply together 0 pairs of objects 
mucb longer than the original operands. For example, the and to add together the resulting products, said instruction 
multiplication together of two 16 bit objects will result in a being executed by the arithmetic unit. 
32 bit word. It is therefore necessary to have available The data storage circuitry can comprise a plurality or 
register capacity to store these results- One way round this register stores each having a predetermined bit capacity 
problem which is currently used is to introduce rounding to 30 matching the length of each of the data Strings, 
reduce the word lengths prior to storage. This however can One particularly useful application of the combined 
introduce rounding errors and can result in the mulupUcation muuiply-add instruction described herein is to multiply a 
not being carried out to adequate precision. vector by a matrix. Thus, the invention further provides a 

Another problem is tbe requirement to execute two method of operating a computer to multiply a vector by a 
instructions, a multiplication instruction fallowed by an add 35 matrix wherein the vector is represented by a vector data 
instruction. Not only do these instructions take up space in stnng coumrising a plurality of substrings each denning 
the instruction sequence stored in memory but they take time vector elements and wherein me matrix is represented by a 
to execute. set of matrix data strings each comprising a plurality of 

sub-strings defining matrix elements, the method comprising 
SUMMARY OF THE INVENTION 40 selecting each of said matrix data strings in turn and execut- 

According to one aspect of the present invention there is for cach ***** ™ a **** 

provided an arithmetic unit for executing an instruction to ^ads the selected data string into an input butter means of 

multiply toother n pairs of objects and to add together the arithmetic umi; 

resulting products, said objects being represented by sub- roads said vector data string into said input buffer means; 

strings of respective first and second data strings, the arith- 45 simultaneously multiplies respective pairs of vector clc- 

meijc unit cxirrmrisinrj: ments and matrix elements of said data strings to 

input buffer means tor holding said first arjd second data generate respective products; 

strings; adds together said products; and 

a plurality (n) of multiplication circuils for simultaneously 50 generates a result, 

multiplying together respective pairs of objects, cach p or a fc clter understanding of the present invention and to 

multiplicalion unit having a pair of inputs for receiving 5DOW bow the same may be carried into effect, reference will 

respective objects defined by sub-strings of each of the DOW ^ madc b/ ^ y 0 f sample to the accompanying 

first and second data strings end providing an output; drawings, 
addition circuitry connected to receive tbe outputs of the ss 

multiplication circuits and operable to add together the BRIEF DESCRIPTION OF THE DRAWINGS 

resulting products of multiplication of the respective mG 1 ^ a b]dck of a processor and memory of 

pairs of objects to generate a result; and a computer; 

output buffer means for holding tbe result. fig. 2 is a block diagram of a packed arithmetic unit; 

This allows a combined multiply-add operation to be «n ^ 3 ^ m of ^ ^ m mc ^ 

earned out in response to execution of a single mstrucuon. r v ■ 1 j 7C 

It also has the advantage that the length of the result will FIG. 4 is a block diagram oC an obvious packed anthmclic 

always be less than the length of one of the data strings. vmt <* crahD S 0d *™ P ackcd *° v "* 

Therefore, it can be ensured that the length of tbe result will PlG. 5 is a block diagram of an obvious packed anthmeuc 

not exceed the available capacity of the register for storing $5 unit ^faka operates on a packed source operand and an 

the result This is particularly useful in a packed arithmetic unpacked source operand; and 

environment, where an operand comprises a plurality of FIG- 6 shows a muldpfy-add unit. 
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DESCRIPTION OF THE PREFERRED Some instructions provide a double length result. These 

EMBODIMENT stoic the first pan of the result in the normal way. Id a 

FIG. 1 shows a processor in accordance with one embodi- subsequent additional stage, the second part of the result is 

ment of the. present invention. The processor has ihrce siored in the next register in the regisler file 12 by asserting 

execution unite including a conventional arithmetic unii 2 5a Double signal 58. 

and a memory access unit 4. In addition there is a packed Branches 50 Deed to read and adjust the instruction 

arithmetic unit 6. Tho processor also includes an instruction pointer 14. These cause the SI Reg signal not to be asserted, 

fetcher S, an instruction register 10, a register file 12 and an and so the instruction pointer 14 provides the Source 1 value 

instruction painter 14 all of which operate under the control on ^ ^ 2 value is provided in the normal way 

of a control unit 16 of the processor. The register Alp m ( cilDcr from a register in the register file 12, or the constant 

composes a set of renters each haymg a prcG>tcnnmcd bit ^ 18) ^ aritnmctic unit 2 cxccutcs branch calcula- 

?5?2 ^LS? Sd^^lLS iffift ^ aid its result is stored into the fctcher 8 on the New IP 

It is nm poMnbk to address individual locations within a m fi4 ^ mai3 tbe registex mc signalled by the 

register. When a register is accessed, the entire contents of ™ p " ' - " 7" "T TT. i „ vt-. tH- " _ 

the register aie concerned. Ine processor further includes a „ 50 from 4 ^ e c0nlro1 16 ^ ^ the 

constant unit 18 and a select unit 20. The constant unit 18 fc,cbcr from a ilddl ^ <s - 

and select unit 20 are also operated under the control of tho Conditional branches must execute In two stages depend - 

control unit 16. The processor operates in conjunction with ing on the state of condition line £2. The first stage uses the 

a memory 22 which holds instructions and data values for Dest register as another source, by asserting a Read Dest 

effecting operations of tbe processor. Data values and signal 45. If the condition is satisfied, then the normal branch 

instructions are supplied to and from the memory 22 via a 20 source operands are read and a branch is executed, 

data bus 24. The data bus 24 supplied data vahtes to and Calls must save a return address. This is done by storing 

from the memory 22 via a memory data input 26. Tbe data me instruction pointer value in a destination register prior to 

bus 24 also supplies data to the instruction fctcher 8 via a calailAtiiir* the branch taract 

fetcher data input 28 and to the memory access unit 4 via a L ^ ' 

memory access read inpui 30. The memory ir aAircsficd via 25 The computer described herein has several important 
the select uiut 20 n address input 32. The select unit 20 is qualities- 
controlled via a fetch signal 34 from the control unit 16 to Source operands are always Ihe natural word length: 
select an address 3 6 from the fctcher S or an address 38 from There can be one, two or three source operands, 
die memory access unit 4. ^^^^ff 011 ^^ 2 THe result is always the natural word length, or twice the 
immihe cojurolumt 16 contol read and wntc nu^nons *> ^ x h ^ fe pcnaltywhenit 
to and from the memory 22. Ilie instruction fctcher 8 fcichcs fe ^ ^ ^ ^ k ^ ix ^s ai extra stage to 
mstrucnons from the memory 22 under the control of the ^ Jmd rat h« than one, registers. For this 
control umi 16 as follows. An address 36 from winch ^ assu ^ c a nalural ^ ^ of M bits , ^ • 
mstrucuons are to be read is provided to the memory 22 via cacb ' slcr in to re ^ lCr ^ has t determined capacity 
the select unu 20. These instructions are provided via the 35 c ^ r r J 
data bus 24 lo the fetcher data input 28. When the instruction _ ^ ^ M . , a 
fetcher has fetched its next instruction, or in any event has ™» execution units 2,4,6 do not hold any state between 
a next instruction ready, it issues a Ready signal on line 44 instruction execution. Tbus subsequent instructions are inde- 
to the control unit 16. The instruction which is to be pendent 
executed is supplied to the instruction register 10 along 40 Non-Packed Instructions 

instruction line Inst 46 and held them during its execution. The arithmctic unit 2 and memory access unit 4, along 
The instruction pointer 14 holds the address of the instruc- with the control unit 16 can execute the following instruc- 
tion being executed supplied to it from the fetcher 8 via lions of a conventional instruction seU In the foUowing 
instruction pointer line 48. A Get fiig^i 47 responsive to a definitions, a register is used to denote the contents of a 
New Inst signal 53 from the comrol unit 16 causes the 45 register as well as a register itself as a storage location, in a 
instruction register 10 to store the next instruction oo Inst manner familiar to a person skilled in the art. 
line 46 and causes the fetcher S to prepare the next instruc- 
tion. The New Inst signal 53 also causes the instruction ~ ' ! T" T 

. . - A 4 ,, 11 P , . , . mov Move a ccnauml 01 a rcgiscc/ into a xccisler. 

Pointer 14 10 store the address of the next instruction, a add Add p»o renters n^rud !B «aT 

branch line 50 from the control unit 16 allows the instruction 50 result in % ifcd legisier (which could be tbe 

fctcher 8 to execute branches. 33 cither or tic scunxa) 

The instruction register 10 provides Source 1 and Source flub f*g "J££** n ° nd -B " *" ™" u " 

2 register addresses to the register file 12 as Regl and Reg2. load Ikonc register as an cvJdress and rend from 

A result register address is provided as Dest. Opcode is that locau'oa » memoty, storing i±& lesuii 

provided to the control unit 16 along line 51. In addition, ss uiw anollier icgisifir 

^ inslnx^oc, will p^d* a consUnt op^d in^td of ~» ^^5,"^"^^ 

encoding One Or both SOUrce registers- The constant IK the loaiinn specified by Ihe a(tfrtfi6 

provided by the constant unit 18. The instruction's source empc Oompft« c^o registers (or a regkier imd a 

values are provided on Source 1 and Source 2 busses 52,54 cowuiai) ibr c^iky. if cacy are cqcai. 

by tbe appropriate settings of the SI Reg and S2 Reg signals 60 e ^^ r ^ ^ mia0D regi$tCf 

at inputs E1,E2. The correct execution unit is enabled by compare r»o ttgtiitzi (w a rcpswi ami n 

providing the appropriate values for Pack Ops, Mem Ops constant) for ordenhaity. ir ihe accomi U 

and ALU Ops signals Tram the control unit 16 in accordance °*i lew thaa ihe £Vsi» acorc 1 imo th« 

with the Opcode on line 51. The enabled unit will normally SS^^ST 

provide a result Res on a result bus 56. This is normally 65 j U mpz jump to a new program, location, d the 

stored in the selected result register Dest in the register rile caucars of & specified register is zero 
12. There air some exceptions to this. 
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-continued 



JHTTipnT 


Jump to a new piogrwn locaupn, if the 


contents of a specified register ts not zero 


nhr 


Perform a bitwise rifiltt fiUrft ol 0 register by 




a constant or another register and store the 




result in a (Jcetinarioa ICgtKer. Th*. RhiR 




is signed because the sign bit is duplicated 




when shifting. 


Shi 


Perform a bitwise USX snid of a register by 




a consul m or nnatber register nod stoic the 




result in & destination regUter 


ot/xot 


Perform n hit-wise logical operation (or/xor) 




an two rtgifil&ft 4nd (tore, remit in 




de» ligation register. 



are .stored io a result buffer 102. The result word thus 
contains four packed objects. An enable unit 101 determines 
i£ any of the unit should be active and controls whether the 
output buffer asserts its output. 
The instructions are named as follows: 



add2p 



sublp 



15 



25 



30 



35 



Packed Unit 

FIG. 2 shows in a block diagram die packed arithmetic 
unit 6. This is shown as a collection of separate units each 
responsible for some subset of packed arithmetic instruc- 
tions. It is quite probable that another implementation could 
combine the functions in different ways. The units include a 
byte rcplicHie unit 70, a twist and zip unit 74, an obvious 20 
packed aridtnietic unit 60, a multiply-add unit 76 and other 
packed arithmetic units 72,78. Only the multrply-add unit 
and obvious packed arithmetic unit arc described in detail 
herein. These arc operated responsive to a route opcode unit 
82 which selectively controls the arithmetic units 70 to S0 4 
Operands for the arithmetic units 70 to 80 arc supplied along 
the Source t and Source 2 busses 52,54. Results from the 
arithmetic units are supplied to the result bus 56. The op 
input to the route opcode unit 82 receives the Pack Ops 
instruction from the control unit 16 (FIG. 1). It will be 
appreciated thai the operands supplied on the Source 1 and 
Source 2 busses arc loaded into respective input buffers of 
the arithmetic units and ihc results supplied from one or two 
output buffers to one or two destination registers in the 
register file 12. 
Obvious Packed Arithmetic 

The obvious packed arithmetic unit 80 performs opera- 
tions talons the two source operands as containing several 
packed objects each and operating on respective pairs of 
objects in the two operands to produce a result also con- 
taining the same number of packed objects as each source. 
The opera i ions supported can be addition, subtraction, 
comparison, multiplication, left shift, right shift etc As 
explained above, by addressing a register using a single 
address an operand will be accessed. The operand comprises 45 
a plurality of objects which cannot be individually 
addressed. 

FIG. 3 shows the symbols used in the diagrams illustrat- 
ing the arithmetic units of the packed arithmetic unit 6. 

FIG. 4 shows an obvious packed arithmetic unit which 
can perform addition, subtraction, comparison and multipli- 
cation of packed 16 bit numbers. As, in this case, the source 
and result, bus widths arc 64 bit, there are four packed 
objects, each 16 bits long, on each bus. 

The obvious packed arithmetic unit 80 comprises four 
arithmetic logical units ALU0-ALU3, each of which are 
controlled by opcode on line 100 which is derived from the 
route opcode unit 82 in FIG. 3- The 64 bit word supplied 
from source register 1 SRC1 contains four packed objects 



Add each respective Sip] » S4ti V* 
complement numbers predating K^i} Overflow 
i* ignored. 

Subtract each respective S2[i] fam Sl[i] as 
Jl'js complement number prodnevna Rffl. 
Overflow k ignored. 

Compare each respective S2[i] with S2(iJ. If 
tney arc equal* set Rfi) to all ones; Lf Utty 
are di (EctsdL set to zero* 
Compare each respective Si[i] with SZ[L] aa 
Signed 2'c complement numbers. If Sl|l"1 W 
greater than or equal to S2tf] sol K[z] to all 
ones; if 51[i] u k*fl than S2[t) sec Rfi] to 
zero. 

Multiply each respective Sl[i) by S2(i] ns 
signed 2's complement numbers setting R[i] to 
tic least tiigruficanC 16 91**5 of the bill (32 



Some obvious packed arithmetic instrument* naturally 
take one packed source operand and one unpacked source 
operand. FIG- 5 shows such a unit. 

Tbe contents of the packed arithmetic unit of FIG. 5 are 
substantially the same as that of FIG- <t The only different 
is that the input buffer 92' Cor tbe second source operand 
receives the source operand in unpacked form. The input 
buffer 92' receives the Gist source operand in packed form as 
before. One example of instructions using an unpacked 
source operand and a packed source operand are shirt 
instructions, where the amount to shift by is not packed, so 
that the same shift can be applied to all the packed objects. 
Whilst it is not necessary for the shift amount to be 
unpacked, this is more useful. 



40 



frhlZp Shift encb respective Sip] left by S2 (which 

ie not packed), setting lo lb© result. 

Khr2pa Shift each respective Sl[i] right by S2 (which 

is not packed), setting ttfG » the resulL 
The shift xb signed, btCtnac the elgn bU U 
duplicated when shirting. 



It is assumed that the same set of operations arc provided 
for packed 8 bii and packed 32 bit objects. The instructions 
have similar names, but replacing the "2" with a " 1" or a "*4". 
Mulliply-Add Unit 
FIG. 6 shows the rjoullrply-add unit 76. The multiply-add 
so nnit comprises two input buffers 104,106 which receive 
respective operands marked SRC1 and SRC2. In the illus- 
trated embodiment, each operand comprises four packed 16 
bit objects SH0] to SIP], S2f0] lo S2[3]. A first multipli- 
cation circuit 108 receives the first object S1[0] from the first 
input buffer and the first object S2[0] in the second input 
buffer and multiplies them together to generate a first 
multiplication result. A second multiplication circuit 110 
receives the second object S1Q from the first buffer and the 



ss 



second object S2[l] from tbe second buffer and multiplies 
SU0]-hS1£3]- The 64 bit word supplied from source register en them together to generate a second multiplication result. A 
2 SRC2 contains four packed objects S2[0}-S2[3]. These are third multiplication circuit 112 receive the Third objects 
stored in first and second input buffers 90,92. The first S1[2],S2[2] from the first and second buffers and multiplies 
arithmetic logic unit ALUO operates on the first packed them together io generate a third multiplication result. A 
object in each operand, S1[0] and S2[0] to generate a result fourth multiplication circuit 114 receives the fourth objects 
R[0]. The second to fourth arithmetic logic units 65 S1[3JS2[3] from each buffer and multiplies them together to 
ALU1-ALU3 similarly take the second to fourth pairs of generate a fourth multiplication result. It will readily be 
objects and provide respective results R[l] to R{3]. These appreciated that the multiplication circuits can take any 
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suitable form, well known to a person skilled m the art. The 
first multiplication result and second multiplication result 
arc supplied to respective inputs of a first adder circuit 116. 
The third and fourth multiplication results are supplied to 
respective inputs of a second adder circuit 118. Each of the 
first and second adder circuits 116,118 add together their 
respective inputs and supply the results to the input of a third 
adder circuil 120. The output of that adder circuit is held in 
an output buffer 122. 

It will be appreciated that the multiplication operation 
carried oui by the individual multiplication circuits will 
generate results having a "double length". That is„ the 
multiplication Together of two 16 bit objects wQl result in a 
32 bit word. The addition of two 32 bit words will result in 
a word which has one or two bits more than 32 bits. This 
means that the capacity of the result huHer can safely be less 
than the capacity of one of the input buffers. 

A type unit 124 receives opcode on line 118 derived from 
the route opcode mm 82 in FIG. 3. The type unit controls the 
output buffer 122. 

The mulliply-add unit is thus capable of executing a 
single instruction of the following form: 



multiply And odd tfte pocked 16-bit signed 2'n 
1 otjecu. 



10 



represented by a vector operand comprising a plurality of 
objects each defined in vector elements. The matrix is 
represented by a set of matrix operands, each comprising a 
plurality of objects defining rxtalrix elements. Each matrix 
operand is taken in turn and loaded into one of the input 
buffers of the unit of FIG. 6. The vector operand is loaded 
into the other input buller The multiply-add unit therefore 
multiplies respective pairs of vector elements and matrix 
elements to generate respective products and adds together 
the products. The result is held in the result buffer 122. 

An exemplary instruction sequence for multiplying a 
vector by a matrix is shown in Annex (Aii) 



15 



Annea* Afi) 

;sum of products of two vectors of N ltf bin obja 
JU pfcinw to the first vector (a) 
'SI points on che second vcciot (B) 



35 



The result of execution of that instruction will be to 
multiply together respective pairs of objects from two oper- 
ands and to add together the results to provide a final result 
remaining within the original width of each operand- This 
allows the rmiUrpbcation step to be carried out without 
incurring rounding errors, which normally happen as a result 
of multiplication steps to keep the word length withiu limits 
determined by the capacity of the available registers. The 
present multiply-add unil thus allows the rrniltiplication to 
be performed at a high precision. Moreover, there is no need 
to incur rounding errors in the addition, because the length 
of the final result win inevitably be less than the capacity of 
one of the input buffers. As described earlier, the capacity of ^ 
the input biitfer will match the capacity of The available 
registers in the register file. Therefore, on execution of this 
instruction it can receive two operands, each occupying a 
single register and can guarantee that the result will occupy 
no more than one register. Conveisely, because the capacity 
of the available register for the result is likely to be 64 bits, 
then it is certainly large enough to take the complete result 
and therefore prevent overflow areas from occurring. 

U will readily be appreciated that it is possible to design 
similar multiply-add units for carrying Out the combined 
multiply-add operation on different sizes of packed objects. 
It will also be readily appreciated that it is possible to hold 
the objects to be multiplied as part of a single operand in 
only one input buffer. 

One example of use of the multiply-add unit is to evaluate 
the sum of products. 

Sum of products is the evaluation of the following: 
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In the illustrated example of FIG. G, N=4, operand SRC1 
is Aj^A^ and operand SRC2 is B 1 ,B 2 ,B 3 ^ 4 . 

The multiply-add instruction con be used to effect this, 
and the sequence of instructions is shown in Annex A(i). 

Another useful application of the multiply-add unit is to 
effect multiplication of a vector by a matrix. The vector is 



What is claimed is: 

1. An arithmetic unit for executing an instruction to 
multiply together pairs of objects from two sets of object* 
55 and to add together the resulting products, at least one set of 
sajd objects being represented by sub-strings forming a data 
string being a packed operand, the arithmetic unit compris- 
ing: 

input buffer constructed and arranged to hold said set of 
60 objects forming said packed operand; 

a plurality of multiplication circuits constructed and 
arranged to multiply together the respective pairs of 
objects, each said multiplication circuit including a pair 
of inputs for receiving the respective objects and 
65 including an output; 

addition circuitry connected to receive me outputs of the 
multiplication circuits and constructed and arranged to 
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add together the resulting products of multiplication of 
the respective pairs of objects to generate a result; and 
an output buffer constructed and arranged to bojd the 
result. 

2. An arilbmctic uniL according to claim 1 wherein ihc 
additioa circuitry comprises a first set of adder circuits each 
arranged TO *idd together the outputs of a distinct two of said 
multiplication circuits, and a further adder circuit arranged 
to add together outputs of each of the first set of adder 
circuits to provide the result. 

3. An arithmetic unit according to claim 1 or 2 wherein me 
input buffer comprises first and second input buffers each 
arranged to bold a respective one of the rwo set of objects. 

4. An arithmetic unit according to claim 1 wherein the 
output buffer has a capacity which is less than the input 
buffer. 

5. An arithmetic unit according to claim 1 wherein the 
Input buffer is constructed and arranged to bold two sets of 
objects represented by sub-strings forming two data strings 
in form of two packed operands. 

6. An arithmetic unit according to claim 1 wherein the 
input buffer is constructed and arranged to hold two sets of 
objects represented by sub-strings; wherein one set of sub- 
strings forms vector elements of a vector operand and the 
other set of sub-strings forms matrix elements of a matrix 
operand, 25 

7. A com] ruler comprising a processor, memory and data 
storage circuitry for holding data strings, wherein Said 
processor comprises an arithmetic unit including 

an input buffer constructed and arranged to bold two sets 
of objects, wherein at least one set of objects is rcprc- 30 
sented by sul>-sirings of a data string forming said 
packed operand; 

a plurality of multiplication circuits constructed and 
arranged to multiply simultaneously together respec- 
tive pairs of objects from said two sets of objects, each 35 
said multiplication circuit including a pair of input* for 
receiving said respective objects and including an out- 
put; 

addition circuitry connected to receive the outputs of the 
rmilriplication circuits and constructed and arranged to 40 
add together the resulting products of multiplication of 
the respective of objects to generate a result; and 

an outpnl buffer constructed and arranged to hold the 
result, and wherein there is stored in said memory a 
sequence of instructions comprising at least an instruc- 
tion to multiply together said pairs of Objects from said 
two seis of objects and to add together the resulting 
products, said instruction being executed by the arith- 
metic unit. ' 

8. A computer according to claim 7 wherein the data 
storage circuitry comprises a plurality of register stores each 
having a prc-detennlncd bit capacity matching the length of 
each of said daia strings and being arranged to store said data 
strings as packed operands including objects represented by 
sub-strings. 



45 



50 



9. A method of operating a computer to multiply together 
a vector and a matrix wherein the vector is represented by a 
vector data siring comprising a plurality of sub-strings 
defining vector elements, the vector elements being objects 
arranged to form a packed operand, and wherein the matrix 
is represented by a set of matrix data strings each comprising 
a plurality of sub-strings defining matrix elements, the 
matrix elements being objects arranged to form packed 
operands, the method comprising selecting each of said 
matrix data strings in turn and executing tor each selected 
data siring a single instruction which: 

loads the selected data string in form of packed source 
objects into an input buffer of an arithmetic unit; 

loads said vector daia string m form of packed source 
objects into said input buffer; 

simultaneously multiplies respective pairs of vector ele- 
ments and matrix elements of said data strings to 
generate respective products; 

add together said products; and 

generates a result. 

10. A method of executing an instruction for multiplying 
together pairs of objects from two sets of objects and adding 
together the resulting products, said method including: 

providing two sets of objects to an input buffer, wherein 
at least one said set of objects being represented by 
sub-strings forming a data string in form of a packed 
Operand; 

supplying said objects to a plurality of multiplication 
circuits Cor multiplying together respective pairs of said 
objects from said two sets, each said multiplication 
circuit receiving the respective objects defined by Said 
sub-strings at a pair of inputs and providing an output; 

receiving the outputs of the multiplication circuits by 
addition circuitry for summing together the multiplica- 
tion outputs to generate a result; and 

holding in an output buffer the result. 

11. A method according to claim 10 wherein said sum- 
ming in the addition circuitry includes summing pairs of 
outputs from the multiplication circuits ra a first set of adder 
circuits and then summing pajrs of outputs of each of the 
first set of adder circuits to provide the result. 

12. A method according to claim 10 wherein both said sets 
of objects arc represented by sub-strings forming two data 
strings in form of two packed operands. 

13. A method according to claim 12 wherein one of the 
two packed operands is a vector operand including a plu- 
rality of said objects defining vector elements and the other 
of the rwo packed operands is a matrix operand including a 
plurality of objects defining matrix elements. 
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23 The UltaSPARC Front End 

The UltraSPARC front end is essentially the Prefetch/Dispatch Unit (PDU). 
Figure 2-2 illustrates the major components of the UltraSPARC-1 front end. 
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F/gun? 2-2 UlrraSPARC-I Front End 

Instructions are prefetched from a pseudo 2-way lfikbyte instruction cache. Each 
line in the I-Cache contains 8 instructions (32 bytes). Every pair of instructions 
has a 2-bit branch prediction field which maintains history of a possible branch in 
the pair. The four prediction states are the conventional strongly taken, likely tak- 
en, strongly not-taken and likely not-taken. The advantage of the in-cache predic- 
tion scheme is that it avoids the alias problems encountered in branch history 
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2. UltraSPARC Concepts 

buffer and other similar structures. Every single branch in the I-Cache has its 
dedicated prediction bits (ignoring the rare case of branch couples), which trans- 
lates into a successful prediction rate of 88% for integer code, 94% for floating- 
point (SPEC92) and 90% for typical database applications. 

Every group of four instructions in the cache has a "next field" which is simply a 
pointer to where the prefetcher should access instructions for the very next cycle. 
In the case of sequential code or for code with a branch predicted not-taken, the 
next field points to the next 4 instructions in the cache. The next field will contain 
the I-Cache index (including the set) of the branch target If a branch is predicted 
taken. The advantage of this scheme Is that the next field can always be fed back 
to the I-Cache without qualifying a possible branch. In order to provide a one-cy- 
cle loop back to the I-Cache. a fast dual-ported structure was used to implement 
the next field and the branch prediction bits. Only one set of the cache is accessed 
during a fetch, saving power and reducing the cache cycle time. Both tags are 
read so that an incorrect set prediction can be corrected. A two-cycle penalty oc- 
curs for a set misprediction. The next field mechanism allows UltraSPARC to 
speculate 5 branches deep representing up to 18 instructions. 

Instructions prefetched by the PDU are expanded to 76 bits in order to facilitate 
decoding done by the grouping logic. These decoded instructions are forwarded 
to a 12-deep instruction buffer which allows the prefetcher to get ahead of the ex- 
ecution units. As long as the instruction queue is kept almost full, cache miss, set 
mi$s and 'micro-TLB (uTLB) miss penalties can be hidden from the execution 
units. 

A single entry uTLB provides the prefetcher with a local copy of the last virtual- 
to-pbysical address translation. In the rare case of a uTLB miss a 1-cycle fetch 
penalty is incurred in order to get the address from the 64-entry fully associative 
instruction-TLB (5TLB). 

The grouping logic always looks at the next four candidates in the instruction 
buffer and based on resource availability and dependencies, issues up to four in- 
structions. Maintaining more than one Program Counter (PC) per group allows 
UltraSPARC to dispatch, in the same group, instructions from two adjacent basic 
blocks. 

23 1 Integer Execution Unit (IEU) 

The Integer Execution Unit (IEU) performs integer computation for all integer 
arithmetic/logical operations The IEU as depicted in Figure 2-3 includes 

San Microsystems, Inc. 
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2. UltraSPARC Concepts 

A separate 64-bit adder is provided for virtual address additions for memory in- 
structions. A simple 64-bit integer multiplier and divider complement the 1EU. 
The multiplication unit implements a 2-bit Booth encoding algorithm with an 
*early-out w mechanism, with a typical latency of 8 clock cycles. A 1-bit non-re- 
storing subtraction algorithm is used in the divide unit, which yields a latency of 
67 clock cycles for a 64-bit by 64-bit division. 



2.12 Floating Point/Graphics Unit (FGU) 

The Floating-Point and Graphics Unit (FGU) as illustrated in Figure 2-4 integrates 
five functional units and a 32 registers by 64 bits Register File. The floating-point 
adder, multiplier and divider perform all FP operations while the graphics adder 
and multiplier perform the graphics operations of the VIS Instruction Set. 
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Figure 2-4 Floating Point and Graphics Unit 



Sun Microsystems, Inc. 

11 



PAGE 1 50/198 * RCVD AT 8/10/2005 9:58:36 PM [Eastern Daylight Time] * SVR:USPT0€FXRF«5/1 ' DNIS:2738300 * CSiD:408 720 9397 * DURATION (mm-ss):41-33 



08/10/2005 19:28 FAI 408 720 9397 BST&Z @151 



VJS Instruction Set User's Manual 

A maximum of two floating-point/graphics Operations (FGops) and one FP 
load/store operation are executed in every cycle (plus another Integer or branch 
instruction). AD operations, except for divide and square-root, are fully pipelined. 
Divide and square-root operations complete out-of-order without inhibiting the 
concurrent execution of other FGops.The two graphics units are both fully pipe- 
lined and perform operations on 8 or 16-bit pixel components with 16 or 32-bit 
intermediate results. 

The Graphics Adder performs single cycle partitioned add and subtract, data 
alignment, merge, expand and logical operations. Four 16-bit adders are utilized 
and a custom shifter is implemented for byte concatenation and variable byte- 
length shifting* The Graphics Multiplier performs three cycle partitioned multi- 
plication, compare, pack and pixel distance operations. Four 8x16 multipliers are 
utilized and a custom shifter is implemented Eight 8-bit pixel subtractions, abso- 
lute values, additions and a final alignment are required for each pixel distance 
operation. 

2.3.3 Load/Store Unit (LSU) 

The Load/Store Unit (LSU) executes all instructions that transfer data between 
the memory hierarchy and the Integer and Floating Point/Graphics Register files. 
The LSU includes the Data Cache, Load Buffer, Store Buffer, and is very closely 
coupled to the second level external cache. See Figure 2-5 for a functional dia- 
gram of the Load/Store Unit. 

2.3.3J Data Cache 

The Data Cache (D-Cache) is a 16kB, direct-mapped cache. It has a 32B (256 bits) 
line size, with 16B (128 bits) sub-blocks. It Is virtually-indexed and physically- 
tagged. The D-Cache is non-blocking and operates using a write-through, no- 
write-allocate policy. Strict Inclusion with respect to the E-cache is maintained, fa- 
cilitating cache coherency. The D-Cache data SRAM is single-ported and can sup- 
port a 64-bit load or a 64-bit store every cycle. In the event of a D-Cache miss, an 
entire sub-block (16B) can be written in one clock. The D-Cache tag SRAM has 
two ports, a read port and area/write port. These two ports allow a load or store 
to perform a tag look-up in parallel with the allocation for an older D-Cache 
miss. 

2.3.3.2 Load Buffer 

The load buffer can eliminate stalls caused by D-Cache misses, load-after-store 
hazards, and other conflicts. Nine entries were implemented to cover the addi- 
tional 6-cycle latency of a D-Cache miss/E-Cache hit. A rate of one load E-Cache 

Sun Microelectronics 
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4. Using VIS 
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Figure 4-30 Three Dimensional Array Blocked-Address Format (Array32) 

See the example on page 101, to see how the arrayB. the load and the add/sub in- 
structions are used and grouped together for maximum throughput. The group- 
ing takes into consideration the latencies of the different instructions Le the load, 
ldda, following the array8, does nor load the voxel just addressed by the array8 
in its grouping, but rather the voxel addressed by array8 in the previous group- 
ing. 

The array instructions operate on all 64 bits of an integer register. Solaris 2.5 al- 
lows all 64 bits of the registers %g2-%g4 and %o0-%o7 to be used; other registers 
cannot be relied on to retain their upper 32 bits. Since the current SPARCompiler 
4.x has limited support for 64-bit integer operations, the array instructions might 
not be accessed efficiently from C. For a coding example, see "Using array8 With 
Assembly Code" on page 101. 

4.7.11 vis_pdistO 

Function 

Compute the absolute value of the difference between two pixel pairs, 
i.e. between eight pairs of vi$_u8 components 

Syntax 

via_d64 visjpflist (vis_d64 pixelal, vis_d64 pixels2, vis_d64 
accumulator J ; 

Description 

vis_pdist() takes three double- precision arguments pixe/sl. pixds2 and 
accum. pixelsl and pixel$2 contain 8 pixels each in raw format. The pixels 
are subtracted from one another, pair wise, and the absolute values of the 
differences are accumulated into accum. Note that the destination register is 
a double-precision floating-point register, which contains an integral value. 

To use vis^pdistO from C, it is necessary for the accumulating register 
accumulator to appear both as an argument and as the receiver of the return 
value. 

Sun Microsystems, Inc. 
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The vis_pdi$tO instruction is intended to accelerate motion compensation 
to support real-time video compression in such applications as H.320 video 
conferencing. 

Example 

vl_d64 accum, pixelel, pixels2; 
accum vis_fzero() ; 

accum n vis_j>dist (pixell, pixel2, accum); 

4.7.12 Block Load and Store Instructions 

Function 

Transfer 64 bytes of data between memory and registers. 
Syntax 

The Block Load and Store instructions do not have a C interface and must 
be coded in assembly language. For assembly language syntax refer to 
section 13.6.4 in the UltraSPARC-1 User's Manual. 

Description 

The block load instruction loads 64 bytes of data, with a block transfer, 
from a 64-byte aligned memory area into eight double -precision floating- 
point registers. 

The block store instruction stores data, with a block transfer, from eight 
double-precision floating-point registers to a 64 byte aligned memory area. 

Example 

Note that the loop must be unrolled to achieve maximum performance. All 
FP registers are double-precision. Eight versions of this loop are needed to 
handle ail the cases of double word misalignment between the source and 



destination. 
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accum = vis_faligndata (lookup, accutn) ; 

lookup = vis_ld_u8_i ( (vis_raa) table, byte2) ; 

accum = vis_faligndata (lookup, accum) ; 

lookup = vis_ld_u8_i ( (vis_ras) table , bytell ; 

accum = vis_f aligndat a (lookup « accum); 

lookup = vis_ld_u6_i ( (vis_ras) table « byteO) ; 

accum o vig^aligndata (lookup, accum) ; 

( (vi£_d64 dst) [ij = accum; 

} 

break; 
} 

/* update pointers, remaining width. */ 
src +=» 8*doubles,- 
dst += 9*doublea # - 
width -a B*doubles; 

/* Finish up any remaining pixels, */ 
for (i a 0; i < width; +*i) 
det [ij = table [src Ci] ] ; 

} 

5.2.4 Alpha Blending Two Images 

This example illustrates an application where two images are blended together. 
For each pair of corresponding pixels in two Images "si" and n s2\ a correspond- 
ing pixel is read from a third control image "alpha", to compute: 

dst b (alpha/256) *sl + (l - alpha/256) *s2 
= (si - s2) * (aXpha/256) + el 

Note that alpha can only range between 0 and 255, so strictly speaking we should 
divide it by 255. not 256. However, the division by 256 occurs for free when we 
perform the vis_frnul8xl6 operation, and the destination will differ from the cor- 
rect result by at most 1. Whether this trade-off is acceptable or not depends on 
the application. 

The following illustrates the processing of one scan line: 

#define vis_OFFSET (addr) ( <addr & 7) 
#define visual iGN(addr) ( (addr) & -7) 
void 

alphajblend (vie^uB *d, vis_ u6 vis^uB *s2, via_u8 *a, 

int width) " ~ 

/* 

* Argument a 

* d = pointer to destination data 

* si = pointer to data for image *si" 

* s2 « pointer to data for image u s2" 

* a = pointer to data for control image alpha 

* width = data width of si, s2 and alpha */ 

San Microelectronics 
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5. Advanced Topics 



[ 

/* Last byte of destination. */ 
vis_u8 *d_end; 

/* Doubleword- aligned pointers. */ 

vie^d64 *d_aligned, *sl_aligned, *s2_aligned, *alpha_aligned; " 

/* Alignment of original pointers. */ 

int d_o££set, si_offset, s2_offeet f alpha_o£ f set ; 

/* Unaligned data from memory. V 

vis_d64 u_alpha_0, u__alpha_l, u_8l^_0, u_sl_l, u_s2_0, u_a2_l; 
/* Properly aligned data. */ 

vis_d64 quadra, dhl_sl, dbl_s2, dbl_a, dbl_d; 
/* Temporaries. */ 

vis_d64 dbl_el_e, dbl_a2_e , dbl_tmpl , dnl_tmp2; 
via_d64 dbl_snml / dbl_sum2; 

/* Edge mask for partial stores. */ 
•unsigned int etnas k; 

/* Loop variables. */ 
int i, times; 

vie_ wx-ite_cj6r (3 « 3); 

/* Four (= 7 - 3) bits of fractional precision. */ 

d_end = d + width - 1; 

d_offset = VIS_OPFSET(d) ; 

d_aligned = (vie_d64 *) VIS_ALlGN<d) ; 

/* Compute initial edge mask for destination. */ 
emask = vis_edge8 (d, d_end) 

/* Align addresses relative to destination alignment and 

load data. */ 

sl_o££set = VTS_OFFSET(sl - d_of f set) ; 
si"aligned = visual ignaddr (si, - d_of f set) ; 
u_b1_0 = sl_aligned[o] ? 
u_al_l = sl_aligned til ; 

s2_offsat = VIS_OFFSET(s2 - d_offset) ; 
s2_aligned = vis_alignaddr (s27 - d_of f set) ; 
u_s2_0 = s2_alignedf 01 ,- 
u_s2_l a s2_aligned[l] ; 

Off_a a VTSJDFFSETCa - d_offset); 
alpha^aligned = visual ignaddr (a, - d_off set) ; 
u_alpha_0 = alpha_aligned[D] 
u_alpha_l = alpha_alignedll] ; 
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/* Number of times through the loop- */ 

times * ((vis_u32) d_end » 3) - ((vi 9 _u32) d_aligned » 3) + 1 ; 

for (i o 0; i < times; ++i> { 

(void) vis_alignaddr( (void *) 0, of f_a) ,- 

/* Set alignment for alpha. */ 

quad_a = vis_f aligndata (u_alpha_0 , u_alpha__i) ; 

u_alpha_0 - u_alpha_l ? 

u_alpha_l = alpha_aligned[i + 3] 

(void) Vis_alignaddr((void *} 0, Sl_o£ £aet) ; 

/* Set alignment for si. */ 

dbl_al = vie_faligndaca (u_si_g , u_sl_l) ,- 

U_Sl_0 = U_sl_l; 

u_si_l = si_aligned[i 4 2) j 



(void) vis_alignaddr( (void *) 0, s2_of fset) ; 

/* Set alignment for s2, */ 

dnl_s2 a viejfoligndata(uj32_0, u_s2_l) 7 

u_b 2_0 = u_s2_l - t 

u_s2_i = B2_aligned[i + 2] 

dbl_sl_e = vie_fexpand(vis_read_hi(dbl_sl)) ; 
dbl_e2_e = visJE expand (vie_readjtii (dbl_s2) ) ; 
dbl_tmp2 » vis_fpsubl6(dbl_a2_e, dbl_sl_e) ; 
dbl_tmpl = vis_fmul8xl€(vis_read_hi (quadra) , dM_tmp2) ; 
dfol_suml = via_£paddl$<dfcl_el_he, dbl_tmpi) ? 

dtol_sl_e = vie_fexpand(vis_read_lo(dbl_sl) ) ; 
dbl_s2_e = vis_fexpand(vis_read_lo(dbl_s2) ) ? 
dbl_tmp2 = vla_f psubie (dhl_a2_e. dbl_sl_e) ; 
dbl_tmpl = vis_fmulBxl6 (vig_read_lo(guad_a) , dbl_tmp2) t 
dbl_sum2 « vie_fpaddl6(dbl_sl_e, dbl_tmpl) - 

dbl_d = vis_fresL_pair(vie_fpackl6 (dbl_sumi) , 

vie_fpackl6 (dbl_sum2) ) ; 

visjpst_8 (dtol_d, (void*) d_aligned, emaek) 
++d_alicrned,- 

emasK = vis_edge8 (d_aligned, d end); 

} 
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1 2 

PERFORMING TREE ADDITIONS VIA The present intention allows far a impl er n e n xatio n of a 

MULTIPLICATION trcc addition with only minor changes to a multqpiKer- 

CROSSBEFER^OBTO RELATED BRIEF DESCRIPTION OF THE DRAWINGS 

APPUCAITON 5 
Thk application is based on provisional ar#Hcation Scr, FIG, 1 shows a simplified block diagram of a mnffi^lfef 
No. 60/000,272, ffled Jon. 16, 1995- in accordance wilh the prior art 

BACKGROUND FIG, 2 shows a bfack diagram of a muLefrHer in accor- 

1rt dance with the prior art 

Thc present invention concerns comrHucr operations 1{J w m 

inmte metttetl in hardware and particularly hardware which FIG. 3 shows a ftimplffied block diagram of tipatihy 

performs tree addnions- which generates partial products for a modified multiplier in 

In compnter systems one or more arithmetic logic units accordance with the preferred embodiment of the present 

(ALUs) are generally utilized to perform arithmetic opera- invention. 

tions. In addition to ALUs, high l^™*** * JPTO. 4 shows a siwpjified block dlaffara of circuitry 

often mclade other special rardw^ to expedite u^pato-- generates caHitott used ^generate partial 

may include hardware devoted to perforrmng rrmlripUcatiOQ * Bt6h cMn^nt of tiuTmeseni iirveirtiOD. 



^^SS^^^^^- 20 DESCRIPTION OF THE PRIOR AKT 

For gggmnle. a tree add operation is 'useful' for video . . ■ . 

oLj^Tln one rase rfTtec add Instractica,- four FIG. 1 ri»w. i. ttockduigmm of an «»^r« mart*** 

iffion? .ffiSS^ in * single sixty-tour bit * 

register are added * another case of a tree add * ^™^*ET^2E ^tt?^ 

inSuetfcm, four by^rfgrnaDy * a single mirty-two bit one-Wt i^,^ J2 
register are +JX^M~ m ^^T^J^^^^^^ 
In order to pcricrm a tree add instruction, it is ^neraUy M ^^3^^ "ung 
required to place each operand in a separate register and 1UW * w 1*™"™ a 

then, to successively use the add operation implemented by FIG. 2 shows a tocr-hfl mnlrrpl frr in accordance with the 
the ALU to add operands together; two at a time. As will be prior art the multiplier mtiltiplic* a four-bit first mU ttipl*- 
uuuerstood, such an execution of a tree add instruction will candXgXA^ (base 2) with a four-bit second muhrplicand 
generally take a large number of iiistraetion cycles. As long Y D Y»Y t Y 0 (base 2) to produce an eight-bit result 
as tree additions are rare, this is not a significant hindrance 35 ZtZ^Z^Z^Zo (base 2). As is understood by those 
to high pcrrorrnan coin a computing system. However, for a • - " skilled in the act, logic AND gates ML, 202, 20&, 204 , 205, 
computing system which frequently performs tree additions, 206, 207, 208, 209 r 210, 211, 212, 213, 214, 215 and 216 
for example for video compression, implementing the tree may be used to generate partial products far the mnlnptica- 
add using a large cumber of instruction, cycles could have a tion. A partial product sum circuit 220 sums the partial 
negative impact on overall system performance, 40 products generated by logic AND gates 201 through 216 to 

_ " - w- - . produce the result. Partial product sum circuit includes both 

SUMMARY; OF THE INVENTION Eduction logic and carry propagate additiOQ logic, as 

m accordance with the preferred embodiment of the described above, 
present invention, a rniiMplicr is modified to perform a tree ^ muirhm'cands, TCJLJLXo and Y^Y^ the 
addilio n. A first value is ipp" 1 *° ^^J****? * * 45 part* rnod^eneWb^^^ gate! 201 tough 
first rnnlliplicaud. The first value is a concatenation of §JL and the result produced by partial product sum circuit 
addends uiKmwmchlhetreeado^ 220 may be placed in a table in suA 

value is input into the mrutiplicr in place of a second cftruTmnlUrdicr. For example, such a table is 

nUdtiplicaiuL Each bit of the second value is at logic zero ^Z " ™Tl bdoar 1 
ex^torafirst subset of bits. The iluret sub 50 tUKnffl 

cf the second value, starting with the low order bit, which table 1 

are at intervals equal to a bit length of each addend, Each of 



the first subset cf bits is set to logic one- In partial product x, 3^ Xj -Xp 

rows in the multiplier which correspond ro the first subset c€ " 

bits, certain partial products arc fenced to logic zero. This is W Y0X0 Y 0 

done in such a way that all the addends for the tree addition 55 T«, YxXo 5. 

arealigiiedincolnnimcf thcmilln^ yA Sg yjg Y ^ 

are then summit to produce a result, — 823 — ^ — " " 

In the preferred cmbc^nroenc, the partial products are z,Z»Zir,%%%^ 

generated using three -input Logic AND gates- Particular — 

partial products are forced to zero by plating a zero op a 60 

control input of the three-input logic AND gate used to In me notation used in table 1 above, me bit position of 

generate the partial product. each bit of both multiplicands and the result is specifically 

Also, in the preferred embodiment, the partial products identified. Additionally, the bits of the mrdtrplicand which 

arc summed in two steps. In a first step, the partial products are used to form each partial product are sr>ecifically set out. 

are reduced into two rows of partial rwodncts. A cany 65 As is understood by those skilled in the art, the information 

propagate addition is then performed on the two rows of shown in Table 1 above may be set out using abbreviated or 

partial products to produce the result. simplified notation, as in Table 2 below: 
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TABLE 2 



Jgt 



X, 



X* 

X* 



X, 

x* 



X, 



X, 



Y 0 



7n Zl Z, 



10 



111 "Bible 2 above, each row of partial products is shown 
without the Y component. Thus, the first row of partial 
products is listed in Table 2 as follows: 



However, this is a simplified notation which represents the 
following partial products: 

Similarly, thr, last row of partial products listed in Table 2 
represents flic following partial products: 

Using the simplified notation of Table 2, an eight-tit mul- 
tiplier may be described as shown in Table 3 below: 



DESCRIPTION OF THE PREFERRED 
EMBODIMENT 

Generally, most of the liiu ii ti y and execution latency of 
a mnMplicr exists in row redaction logic 12 and carry 
propagate addition logic 13. Id the preferred embodiment* 
no changes are made to row reduction logic 12 or carry 
propagate additional logic 13 of a Tnnltipltar in order to 
perform a tree addition. But in partial product generation 
logic 11, the two input logic AND gates are replaced with 
three input logic AND gates. 

FIG. 3 shows that in partial product generation logic H, 
the two input logic AND gates are replaced with three input 
logic AND gates. The multiplier multiplies a four-bit first 
mumpH^nd X 3 X 2 X 1 X C> (base 2) with a four-bit second 
multiplicand Y^Y^Yq (base 2) to produce an dgfrt-Wt 
result ZyZ^ZsiZjZjZ^Zo (base 2). As is understood by 
those skilled in the act, logic AND gates 300. 301, 302, 303, 
394, 305, 316, 307, 30*, 309, 3l#, 311, 312, 313, 314 and 
315 may be osod to generate partial products fox the mul- 
tiplication. When generating partial products for 
20 multirmcation, control inputs B^ B^, B 3> B^ B 5 , B,, 
B e , Ba, B U t B l3f Bu, andB ls are an set to logo 
1 in Order to generate partial products P D , P^ P* F 9r P 5 , 
P* P77 Pb. P10, Pu, Pia. Pis , Pw and P^ 

In addition, selection of multiplicands and selection of 
25 values of control inputs hot B 11 Bj, B 3 , B 4> B 5t B 6 , B a , 
B^Bio, ButB^Bta, B M and B^ can be used to perform 
a tree add as is further described below. 



15 



TABLE 3 



X,X a X, X.X,X a X a X a 



X, 

*7 £ 



x« x, 

X« X* X* 

X, X, Xq 



Xt 
X* 

x, 

X, 
x* 



X* 
X* 

x* 

i 



Xj 

x* 

X, 

5fc 



x« 
x, 
x, 

Xg 

X* 

Xx 

x^ 



X, 

X4 

x, 

X, 
x, 

X* 



X* Xj X^ Xj X$ Y D 
X, X A Xo Y t 

X* X, Xe, Ya 

?: 

y« 



zw ^ 3 z» Zio Z9 z. ^ z* g, z; 3> 



The fmiTtfpiiCT shown in Table 3 multiplies an elgtrt-hH 
first mnltiplicand X^X^^X 3 X 2 X 1 X 0 (ba5c 2) with an 
el^trbit second mnlrtpTicaud ^YeY^YgYzYxYofbase 2) 
to produce an sixteen-bit result 
2^^Z^ i2 Z^oZA ^7 ^^^Z^2^ss& 2), To 
further simplify notation, the partial products and the 
sixteen-hit result may be written without subscripts. Thus, 
the eigit-bit multipl^arion may be icpresented as in Table 4 
below; 



43 



SO 



Particularly, when a tree add is to be performed on a 
plurality of addends within a first register, a first value in the 
first register is input in place of the first irnHnfrllcmd for the 
multiplier The first value is a concatenation of the addends, 
A second "value is input into the multiplier in place of the 
second multiplicand. Bach bit of the second value is at logic 
zero except for a first subset of bits. The first subset of bits 
includes a low order hit of the second value, and includes 
bits of the second value which, starting from the low order 



TABLE 4 



x 
x 



X 
X 
X 



x 
X 
X 
X 



X 
X 
X 
X 
X 



X 
X 
X 
X 
X 
X 



X 
X 
X 
X 
X 
X 
X 



X 
X 
X 
X 
X 
X 
X 
X 



X 
X 
X 
X 
X 
X 
X 



X 
X 
X 
X 
X 
X 



X 
X 
X 
X 
X 



X 
X 
X 

X 



X 
X 



X 
X 



Y 0 
Y, 
Y* 

Y* 

Y, 
Y t 

Y* 
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iit, src at intervals Which are equal to a bit length cf each 
adteod. Each of the first subset of bits is set to logic one; 

For partial jKodnct rows in the nonltiplier which corre- 
spond to the first subset of bits, Hportion of partial piodw*? 
jnthc partial product rows arc forced to logie zero, so that 
addends rot the tiee addition are aligned in columns of the 
TTmttfpli'tti The tow reduction logic 12 and carry propagate 
addition 13 generate a result fur the tree add, wWch is shifted 
to the left a number (rf bits eqnal to the bit length used by all 
addends less one. 

FIG. 4 shows control input generation 400 which gener- 
ates control inputs Bp. B 2 , Bj, B 9T B4, B^ B7, B&, Bp, 
B^B^B^ B 13 , B 14 and B^ for a tree add. Control input 
generation 4#9 generates contra) inputs Bg, B x , B^ B 3> B 4 , 
B^ B 6 , B?, Bg, B^ Bxot B u , B^ B^, B H and B^ for 
example, during register read-out time so that generation of 
the control inputs does net delay operations p« Ruin ed by 
the multiplier. Jh addition, control input generation may be 
used to generate control inputs B Ql B l¥ B^ B 3f B 4 , B* B$, 
B a . B^> B l0 , B u , B^ Bu, B„ and Bu for other 
operations such as a population count. 

for example, is order to perform a tree add on two 
four-bit words, "abcd w and "efgfr", using a xnodified eigbt- 
bdt njnnipliec, the value '^abedefgh" is used in place of me 
first rraflfifcHc^ 2).Thevarao 
00010001 is used in place of an eighJWrit second mnlopli- 
eand Y^YtfYOfsYaYxYoCbasc 3>. For me top row, the 
control inputs for the four least significant bit positions arc 
forced to logic zero. For the fifth tow down, die control 
inputs for the most least significant hit positions are forced 
to logic zero. Hie row redoetion logic 12 and cany propa- 
gate addition. 13 generate a result for the tree add, which is 
shifted to the left four Hts-thcbit length of all addends (eight 
bits) leas the bin length of one addend (four trits)- 

TaMe 5 below illustrates the use of a modified multiplier 
to perform this tree addition: 



ere forced to logic zero. Fox the thirteenth row down, the 
control inputs for the twelve moat significant bit positions 
are forced to logic zero. The row reduction logic 12 and 
carry propagate addition 13 generate aresult for the tree add, 
which is shifted to the left twelve bfts-the bit length of all 
addends (sixteen hits) less the bit length of one addend (fear 
bits). 

Table 6 below illustrates the use of a modified multiplier 

fft perform ttP* ***** yj^friVw; • 



TABLE 6 
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OO0D00000OOQQOO0 
DQOQDOOOOOODQDDO 

oooooooooaoooooo 

DO000D00OQDDX0D 
OOOOODDOOOQDOOOO 



1 

o 
o 
0 

0 
0 
0 

1 

0 
0 
0 

1 

0 
0 

0 



oaMOOOOorjoniir^^ 



30 



35 



In Table 6 above, each. indicates a value forced to 
logic zero by a control input. The result of the tree add is the 
value "ZZZZZZ". In the result register, mis value is shifted 
twelve bits to the left of the least significant bit 

3h order to perform a tree add on two eigiUrit words, 
"abedefgh" and "gEtancp," using a modified sixteen bit 



TABLES 



o 
o 
h 



1 
o 
0 
0 

1 

0 
0 
o 



In Table 5 above* each indicates a value forced to logic 
zero by a contwil input. The result of (he tree add U the value 
"Z2ZZZ"- In the result register, uiis value is shifted four hits 55 
to the left of the least significant bit 

In order to perform a tree add on four four-hit words, 
-abod," ^rfebT "ifld, " "rnnop,"7cusiiis a modified sixteen bit 
multiplier, the value "abcdc^irjWmnop" is usedin place of 
the first imiltiph"<?ajKLThc value 00O10OO1 00010001 is used 6Q 
in place of a second multiplicand- For the top row, the 
control inputs for the twelve least significant bit positions 
are forced to logic zero- For the fifua row down, the control 
inputs for the eight least significant bit positions and the four 
r*tv* signific ant hit positions are forced to logic zero. For the 6S 
ninth row down, the control inputs for the four least signifi- 
cant bit positions and (he eight most significant bit positions 



innrupticr, the value ^abrdefghflKlmnop" is usedin place of 
the first multipHeand. The value 0000000100000001 is used 
in place of a second multiplicand- For the top row, the 
control inputs for the eight least significant bit positions arc 
forced to logic zero. For me rrinm row down, the control 
inputs fc* the eight most rigmficant hit positions are forced 
to logic zero. The row reduction logic 12 and cany propa- 
gate additio n 13 generate a result for the tree add, which is 
shitted to the left eight bits— the bit length of all addends 
(sixteen bits) less the bit length of one adrlmri (eight bits). 

Table 7 below fflustraces the use cf amodtn^mumplifcr 
to perform this tree addition: 
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TABLE 7 



doodddooddooOOdO 



0000000000000000 
O00000O000OQOOOO 



OOOOOOOQOOOrtXTfT) 



In Table 7 above, each indicates a value farced to 
logic zero by a control input The result of the nee add is the 
value *TZ2ZZZZr- tnfrc result register, mis value is shifted 
eight bits TO tiie left of the least significant bit 

As Will be understood by persons of ordinary skill in the 
art, as a variation on the present invention, any partial 
prodnct wim a value of logic zero may be generated using 
the control input for the partial product. For example, in 
order to perform a tree add on two eight-bit words, "a/bo- 
defgh" and "ijklmnop^ using a modified sixteen bit 
multiplier, the value 11 mill 111 11111 is used in place of a 
second multiplicand. For all iows except the top row and the 
ninth row, the control inputs are forced to logic zero. Per the 
top row, Che control inputs for the eight least significant bit 
positions arc farced to logic zero. For the ninth row down, 
the control inputs for the eight most significant bit positions 
arc forced to logic zero. The row reduction logic 12 and 
carry propagate addition 13 generate a result for Ac tree add, 
which is shifted to me left eight bits— the bit length of all 
addends (sixteen bits) less the bit length of one addend (eight 
bits). 

Table 8 below illustrates the use of a modified multiplier 
to perform this tree addition: 



TABLES 




10 



20 



30 



40 



45 



50 



55 



60 



In Table 8 above, each indicates a value forced to 
logic zoo by a control input. The result of the tree add is the 
value "722222222?* In the result register, this value is 
shifted eight bits to the left of the least significant bit €5 

The foregoing discussion discloses and describes merely 
exemplary methods and embodiments of the present inven- 



tion. As wSU be understood by moseramiliar with the art, the 
inventioa may be embodied in other specific forms without 
departing from the spirit or essential characteristics thereof! 
For cqamplft, the partial jxoducts might be generated wim 
logic equivalents of a three input logic AND gate or with a 
2-to-l multiplexor. Accordingly, the disclosure of the 
present invention is intended to be illustrative, hut not 
lumtmg, of the scope of the invention* which is set forth in 
the following claims, 

I fifflTTTT 

% A mnTHptier which also performs tree addition Dam- 
prising: 

partial product generation means for generating partial 
products for multiplication, the partial prodnct gencra- 
tion mftaitft Including zeroing means fbr farcing a 
subset of the partial products to zero when performing 
a tree addition; and, 

partial product sum means, coupled to the partial product 
generation means, fox smarting the partial products 
generated by (he partial product generation means to 
produce a iegnlL 

2. A multiplier as in claim, 1 wherein the partial prodnct 
means comprises a plurality of three -input logic AND gates 
arranged In rows, each row of logic AND gates used to 
multiply all bits of a first maUq^icand by a single bit cf a 
second rtirJtipHcand during multiplication. 

3. a multiplier as in claim. 2 wherein the zeroing means 
comprises a control input to each of the face-input logic 
AND gates. 

4. A multiplier as in chum 1 wherein the partial product 
sum means comprises: 

row reduction logic, the row reduction logic reducing the 
partial products generated by the partial product gen- 
eration means into two rows of partial products; and, 

logic which performs a functional equivalent of a carry 
propagate addition on urc two rows of partial products 
to produce the result 

5. A multiplier as in claim 1 wherein when a tree add is 
to be performed on a plurality of addends: 

a first value i s input into the multiplier in place of a first 
multiplicand, the first value being a concatenation of 
the addends; 

a second value is input into the multiplier in place of a 
second multiplicand, each bit of the second value being 
at logic zero except for a first subset of bits comprising 
bits of the second value which, starting from the low 
order bit, are at intervals which arc equal Co a bit lefigih 
of each addend, each of the first subset of tails being set 
to logic one; and, 

for partial product rows in the multiplier which corre- 
spond to the first subset of bits, a portion cf partial 
products in tiic partial product rows arc forced to logic 
ze*o> so that addends for the tree addition are aligned in 
co lumns cf the multiplier. 

6. A multiplier as in claim 1 wherein When a tree add is 
to be performed on a plurality of addends: 

a first value is input into the multiplier in place of a first 
multiplicand, the first value including the addends; 

a second value is Input into the multiplier in place of a 
second mnkiplicand, each bit of the second value being 
at logic zero except fbr a first subset of bits comprising 
bit* of the second vame which, starting from the low 
order bit, are at intervals which are equal to a hit length 
□f each addend, each of the first subset of bits being set 
to logic one; and, 

for partial product rows in the multiplier which corre- 
spond to the first subset of bits, a portion of partial 
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products in the partial product rows arc forced to logic (<U)ieduring the partial products into twoxows of partial 

zero, so that addend* fox &c tree addition at* aligned in products; and, 

cohtmns of the romijpller (<L2) performing a fractional wruivali^of a carry t^oi»- 

7. A nudnplior as in claim 1 wherein when a tree add is gate addition on the two rowu of partial products to 
to be perrorrned on a plurality of addends: 5 produce the result 

a first value is input into the multiplier in place of a first 1L A inefbod tor using a murtrpper to perform a tree 

nufflipltond, the first value being a concatenation of addition cormxising the steps ofc 

the addends; (a) igpittrog * flTT tv^1^ to th*>. Tnnltirjlter in place of a first 

a second value is input into the multiplier in place of a ^ multiplicand, the first value including addends upon 

second multiplicand, each bit of the second value being which flic tree addition is performed; 

at logic one; and, ^ forcing a subset Of the partial products to zero when 

a portion of the partial products in the multiplier art performing a tree addMon; and, 

forced to logic zero, so that addends for the tree * TT. .... ^ m j ll< . A - 

addition TarS«d^ eornrnns of the multiplier. . (3 smsm^ rho parUal 

8. A^cTfoSg a rrailtipUer to ? a tree * 1Z A rnemod as in dana n »JW P) prtd 
ad^tion^SS Sp,T^ pickets are forced 

(a )irn^r^^ ^^^^ ^ ^ 

Hi H WnTt«mit. the first value being a concatenation of partial product. . - - 

aoS^^ the tree add-on is performed; „ 13.Amc4iodas _in daimU w 

(V) inputting a second value to the rrinloplicr in place of foDowuE substeps: 

a^ccoodmu^^ each bit of the ^ sccooaTralue (^rcoudr^ther^proau 

being at logic zero except fee a first subset of bits products; and, 

comprising bits of the second value which, starting (c3)pcrfoxmingafuncfo^ 

with the low crier bit, are at intervals equal to a bit 25 gate addition on Che two rows of partial products to 

length of each addend, each of the first subset of hits produce the resnh- 

bcrng set to logic one; 14. Arnethod as in claim 11 wherein in step (a) the first 

r c \ partial ™coduct rows la the multiplier which value is a concatenation of the addends, 

correspond to ttw first subset of bite, farcing to logic 15. A method as in claim IX additionally including the 

zero a pcrtiatt of partial products in the partial product 30 following step: 

rows, so that addends for the tree addition are aligned inputting a second value to the multiplier in place of a 

in columns of the mtdtrpher* and, second nnilttpttcand, each hit of the second value being 

(d) summing me partial products to prodnce a result at logic one. ^ _ _ 

9. A method as in claim 8 wherein in step (c) partial 16. A method as in daim 15 wherern m step (b) partial 
troducts are forced to zero by placing a zero on a control 99 products are torecd to zero by placing a zero on a control 
input of three-input logic AND gate used to generate the input of thr^input logic AND gate used to generate me 
partial product tttrtM taoduct. 

It. Arnethod as in claim S wherein step (d) includes the 
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ABSTRACT 



The integer execution unit (1EU) or a central processing unit 
(CPU) is provided with a graphics status register (GSR) for 
storing a graphics daia scaling Factor and a graphic* data 
aligrjmcm address effect Additionally, the CPU jfe provided 
with a graphics execution unit (GRU) for executing a 
number of graphics operations in accordance to the graphics 
data scaling factor and alignment address ofEset, the graphics 
data having a number of graphics data formats. In one 
embodiment, the GRU is also used to execute a number of 
graphics data addition, subtraction, rounding, expansion, 
merge, alignment, multiplication, logical, compare, and 
pixel distance operations. The graphics data operations are 
eaicgonzed into a first and a second category, and the GRU 
concurrently executes one graphics operations from each 
category. Furthermore, under this embodiment, the IEU is 
abo used to execute a number of graphics data edge han- 
dling and 3-D array addressing operations, while the load 
and store unit (LSU) of the CPU is also used to execute a 
number of graphics data load and store operations, including 
conditional store operations. 

9 Claims, 20 Drawing Sheets 
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Figure 6a 
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opcode 


op3 


operation 


98a ~ 


ALIGNADDRESS 


000011000 


calculate address for misaligned 
data access 


98b~ 


ALIQNADDRESSl^L 
ITTLE 


000011010 


calculate address for misaligned 
data access, little endian 


100- 


FALIGNDATA 


001001000 


performs data alignment for 
misaligned data 



Exemplary Assembly Language Syntax 


alignaddr 


™Ersb «grs2> reg rd 


alignaddrl 




faligndata 


fre&rsi, freg n 2> ^Srd 



Figure 7a 
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0001 1100 1 


4 16-bit add 


FPACK32 


000111010 


2 32-bit add 


FFACKFBC 
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4 16-hit subtract 



Exemplary Assembly Language Syntax 



fpackl6 fre §rs2- fre &rd 



fpack32 irfegrslr freg rs 2, fr g&d 
fpackfuc freg rs2 . fre Srd 



Figure 8a 
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op = 10 f op3 = 110110 



opcode 


opf 


operation 


PDIST 


000111110 


distance between 8 8-bit components 



138- 



Exemplary Assembly Language Syntax 



pdist fregrsi, freg rS 2, freg rd 



Figure 9a 
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op = 10 , op3 = 110110 





opcode 


opf 


operation 


140a ~ 


EDGES 


000000000 


8 8-bit edge boundary processing 


140b „ 


EDGE6L 


0OO0D0O1Q 


8 8-bit edge boundary processing, little endian 


140c- 


EDGE16 


000000100 


4 16-bit edge boundary processing 


140d- 


EDGE16L 


000000110 


4 16-bit edge boundary processing, little endian 


140e- 


EDGE32 


000001000 


2 32-bit edge boundary processing 


140f ~ 


EDGE32L 


000001010 


2 32-bit edge boundary processing, little endian 



Exemplary Assembly Language Syntax 


edgeS 


regrsl, reg re2 , reg r d 


edge81 


refrsl, reg rs2 , regrd 


edgel6 


"grsl. ^^2. regfd 


edgel6l 


reg rs i, reg rs2 ,reg rd 


edge32 


r^rsl- re firs2. re g r d 


edge321 


reg rsl , reg rs 2, reg rd 



Figure 10a 
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BIG ENDIAN 



Edee Size 






T?-Ia*V» + PJmi 

-tUgHt iLrOLgB 


Q 

a 


000 


11111111 


10000000 


8 




01111111 


11000000 




010 




111 AAAAA 

11100000 


a 

0 


ni 1 

UJ.1 




Hil AAA A 

11110000 


D 

0 


100 


ULHJUllll 


11111000 


Q 

O 


101 


000 001 11 


11111100 


6 






1111111ft 


8 


111 


00000001 


11111111 


16 


OOx 


1111 


1000 


16 


01* 


0111 


1100 


16 


10r ; 


0011 


1110 


16 


111 


0001 


1111 


32 


Ozz 


11 


10 


32 


lxx 


01 


11 



LITTLE ENDIAN 



Edge Size 


LSB 


Left Edge 


Right Edge 


6 


000 


llllllll 


00000001 


fi 


001 


11111110 


00000011 


8 


010 


11111100 


00000111 


6 


on 


11111000 


00001111 


8 


100 


11110000 


00011111 


8 


101 


11100000 


00111m 


6 


110 


11000000 


01111111 


a 


in | 


10000000 


llllllll 


16 


00* 


1111 


0001 


16 


01x 


1110 


0011 


16 


lOx 


1100 


0111 


16 


llx 


1000 


1111 


32 


Oxx 


11 


01 


32 


irx 


10 j 


11 



Figure 10b 
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op = 


10 , op3 - 110110 




opcode 


opf 


— operation 


142a- 


ARRAYS 


000010010 


convert 8-bit 3-D address to blocked byte address 


142b- 


ARRAY16 


000010010 


convert 16-bit 3-D address to blocked byte address 


142c- 


ARRAY32 


ooooiooio 


convert 32-bit 3-D address to blocked byte address 



rs2 value 


number of 
elements 


rs2 value 


number of 
elements 


0 


64 


3 


512 


1 


128 


4 


1024 


2 


256 


5 


2048 



rsl: z integer 



63 



2 fraction 



55 54 



y integer 



44 43 



y fraction 



33 32 



x integer 



22 21 



x fraction 



1110 



8 bits: 


upper 


middle 


lower 








Y 


X 


Z 




1* 


Z 


1 v 


X 




20 17 
+ 2W2 + 2rs2 


17 

+ rs2 


17 


13 


9 


5 


4 


2 


0 


16 bits: 




upper 


middle 


lower 




Z 


Y 


X 


* \ 




X 


Z 


Y 


X 


0 




21 16 18 

- 2 rs2 + 2 rs2 + rs2 




18 


14 


10 


6 


5 


3 


1 


0 


32 bits 




upper 


middle 


lower 




Z 


Y 


X 


Z 


Y 


X 


Z 


Y 


X 


00 


22 19 19 
+ 2rs2 +2rS2 +rs2 




19 


15 


11 


7 


6 


4 


2 


0 



Exemplary Assembly Language Syntax 

j*ray8 rcgrsl* ™Sra2. reg^ 

arrayl6 — — 



regrs!, reg; rs2 , reg rd 



array32 



^Ersl. ^Ersto reg r( j 



Figure 11a 
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CO 
CO 



A 
CD 



oo 
o 

CO 



n: 



o 
o 

o 



I 
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CD 
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1 _ 2 

CENTRAL PROCESSING UNIT WITH graphics funcuoos known to have been integrated include 

INTEGRATED GRAPHICS FUNCTIONS only frame butter checks, add wilb pixel merge, and add 

t wua 2-buffer merge. Much of the graphics processing on 

This is a Continuation of application Scr. No. 05/236, these modem prior an systems remain being processed by 

572* filed Apr. 29, 1994 now abandoned. a ihe general purpose CPU without additional built-in graph- 

BACKGROUND OF THE INVENTION " by ^ ""^ P— ■ 

1. Field of the Invention As win be disclosed, the present invention provides a cost 
The presto invention relates to the field of computer , n cff "5 ve > P^raance CPU with intefirated nadvc 

systems. More specifically, the present invention relates to a f^™* ^ •* r "^^Y overcomes much 

cost effiicu^, high performance central processing unu ^ P**™"** garners ^nd achieves ihe above 

(CPU) havina imegraled graphics capabttties. described and other desired result 

2. Background SUMMARY OF THE INVENTION 
J^l^^ S tC hjgh perfor- * under the present invention, ibe desired results are advan- 

lypicaliy perfonn large amount of figure manipulation . ™«i,s« - . r° 7™»v j , 

operatfansstichas iranifonnatioDS and clings nsiLfl™ 2^ c ^ 1 ""L" l0Wt 0,1C 

ing point dau. IT* second barrior b in iZWcr fix J poinl *» KSl^S^ "? ""J? T* l0 ,. StQre a 

prolog through^,. Graphics applications STo^, tT^* dal \. ak S n '" eot 

perfoiml^c an ^nt of aerations sud.^ addn^ofifeetHic atleasi oce partitioned exBcwbon palh is 

coovetsio^d color tate^ktiS7^„ fi^d !T » » number of graphics ope, ations on graphics 

point data. n» third barricTk in n^"^S ^ h J^ a numbe, of graphics data fo^ Some of 

5££ S^iSSS/"" ^ ^ ^ ^ ^^.opisopo JHSSA 

v _. . „ , ____ . graphics data scaling factor and alignment address offset. 

J^^KS ^ r St1 i r 5mpU " !r I« ^ embodiment, the GRU comprises a Cn.1 and a 

££S.hTSES£? r™^, Wh ^ t> foncd addilJ ° 0 - d subtracdoi, esparaoo, mage, X 

c i . 35 ^rations on graphics data using the alignment address 

some later pnor an computer systems provide auxiliary * onset. The second partitioned execution path is used to 

duplay -processors- The auxiliary display processor would independently execute a number of partitioned 

off load these later CPUs from some of the display related multiplication, a number of pixel distance compulation, and 

operations. Howcvci, these later CPUs would still be respon- compare operations on graphics data, and a number of data 

sMa for most, of the graphics processing, lypicaliy, the „ packing operations on graphics data using the scaling factor. 

tSS^tS^^c^f T T Pli0t ™^ this embodiment, the integer execu- 

^epmcessotsover .buses. The auxiliary display proccs- ^S^^S SSHjta 

^^^^IVeZZT^^ « fc^s^wMJethnloadand^umt^U) 

While wS Si ho P ZeTtee * " b ° VSCd -'° <Xe T ? M 40(1 

approach is costly aud comptex ^ operations, mcludtng partial coadmooal store opera- 

Other later prior art computer systems would provide 

auxiliary graphics processors with even richer graphics 5Q BRIEF DESCRIPTION OF THE DRAWINGS 

functions. The auxiliary graphics processors would oH load t-w- 1 ,11,^^ a* miT c 

the CPU* of these kicr prior art computer systems from ™iLL the CPU of an exemplary graphics 

most of the graphics procissmg. Under mis ap^oX*^ *3 lcach ^ of the present 

'sive dedicated hardware as well as sophisticated software ^ L „ 

interface between the CPUs and the auxiliary eraohics « /} 2 ^tratcs the relevant portions of one embodimeot 

processors will have to be provided. WmTe r^rfoTmancc will of the Graphics Execution Unit (GRU) in further detail, 

increase even more, however, the approach is even more FIG - 3 iHustrates the Graphics Status Register (GSR) of 

costly and more complex than the display processor the GRU in further detail. 

approach- f IG. 4 iUnsirales the first partitioned execution paih of the 

In the case of microprocessors, as the technology contin- eo GRU in furtbcr d&laiL 

ucs 10 aDow more and more circuitry lo be packaged in a FIG. 5 illustrates the second partitioned execution path of 

small area, il is increasingly more desirable 10 integrate the ^ GRU in further detail. 

general purpose CPU with built-in graphics capabilities FIGS. 6a-6c illusuaie the graphics data formats, the 

instead. Somo modem prior art computer systems have graphics instruction formal*, and the graphic instruction 

begun to do thai. However, the amount and nature of 65 groups ro mrther detaiL 

graphics mnciioos integrated io these modem prior an FIGS. 7a-7c illustrate ihe graphics data alignment 

computer systems typically arc still very limited Particular ins true dons and circuitry in further deiail. 
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FIGS- 8a~8g illustrate the graphic data packing instruc- Referring now to FIG. 2, a block diagram Illustrating the 

lions and circuitry in further detail. relevant portions of one embodiment of the GHU in further 

FIGS. 9a»9b fllustraie the graphics data pixel distance detail is shown. In this embodiment, the GRU 28 comprises 

compulation instruction end circuitry in further detail, a graphics status register (GSR) 50, a first and a second 

FIGS- IQ<*-10E> illustrate the graphics data edge handling 5 partitioned execution path 32 and 34- The two execution 

instructions in further detail. . paths 32 and 34 are independent of each other. In other 

FIGS. llfl-Uo illustrate the graphics data 3-D array w ords, two graphics instructions can be mdcpendently 

addressing instructions and circuitry in further detail. issued into the two execution paths 32 and 34 at the same 

DETaII-ED description* ^ To $ clber * lbB y independently execute the graphics 

UfeiAiLW L>fc3^KirilUW iQ instructions, operating on graphics data. The functions and 

In the following description, for purpose?; of explanation, constitutions of these elements 50, 32 and 34 will he 

specific numbers, materials and configurations axe set forth described in further derail below with additional references 

in order to provide a thorough understanding of the present to the remaining figures. 

invention. However, it will be apparenl to one skilled in the Referring now lo FIG. 3, a diagram illustrating the 

art that the present invention may be practiced without the relevant portions of one embodiment of the graphics status 

specific details- In other instances, well known systems are 15 register (GSR) is shown. In this embodiment, the GSR 50 is 

shown in diii^rammauc or block diagram form in order not used to store the least significant ihrce bits of a pixel address 

to obscure the present invention- before alignment (aJignaddr_offect) 54, and a scaling factor 

Referring cow to FIG. 1, a block diagram illustrating the to be used for pixel formatting (scaIe_JacLor) 52, The 

CPU of an exemplary graphics computer system incorpo- ahgnaddiL_ofIset 54 is stored in bits GSR[2:0], and the 

rating the teachings of the present invention is shown. As 20 scale_„ factor 52 is stored in bits GSR[6:3]. As will be 

illustrated, the CPU 24 comprises a prefetch and dispatch described in more detail below, two special instructions 

uniL (PDU) 46 including an instruction cache 40, an integer RDASR and WRASR arc provided for reading from and 

execution unit (IEU} 30, an integer register file 36, a floating writing into the GSR 50. The RDASR and WRASR 

point unit (i'PU) 26, a floating point register file 38, and a instruction^ and the usage of alignado^Lofrset 54 and 

graphics execution unit (GRU) 28, coupled to each other as 25 scaIe_Jactor 52 will be described in further detail below, 

shown. Additionally, the CPU 24 comprises two memory Referring now to FIG. 4> a block diagram illustrating the 

managemenl units (IMMU & DMMU) 44<z-44n, and a toad relevant portions of one embodiment of the first partitioned 

and store unit (LSU) 4S including a data cache 42, coupled execution path is shown. TTjc first partitioned execution path 

10 «ach other and Lhc previously described elements as 32 comprises a partitioned carry adder 37, a graphics data 

shown. Together they fetch, dispatch, execute, and save 3D alignment circuit 39, a graphics data expand/merge circuit 

execution xesults of instructions, including graphics 60, and a graphics data logical operation circuitry 62, 

instructions, in a pipelined manner coupled to each other as shown. Additionally, the first 

The PDU 46 fetches instructions from memory and dis- partitioned execution path 32 further comprises a couple of 

patches them to the IEU 30, the FPU 26. the GRU 28 and registers 35n-35f>, and a 4:1 multiplexor 43, coupled to each 

the LSU 43 accordingly. Prefetched instructions are stored in 35 oUlcr and previously described elements as shown. At 

the rosiruction cache 40. The IEU 30, the FPU 26, and the * ach dis P« cn » ? e J™ 46 mR 7 dispatch cither a graphics 

GRU 28 pcribrm integer, floating point, and graphics opera- p ^^ cd add/subtja< ?. ^metion, a graphics data 

tions respectively. In general, the integer o^er^results S^^^-^'^f^S^f **** W***™*? «ttue- 

are stored in the iateger re^sTer tile 36, whcrcaXfloating ^£ ZJ^tJ^S 1 ^ °P c f.^ 0fl "> **P«* 

^ „^u-~. ^ __t7 , . , tr . b uoncd execution path 32. The partitioned carry adder 37 

ffiSlE^^ cxccutes lh * Phoned graphics data adoVsubtract 
point register file 38. AddiUonaliy, the IEU 30 also performs instructions, and the graphics data alignment circuit 30 
a number of ^aphics operations, and appends address space executes the graphics data alignment instruction using the 
identifiers (AST) to addresses of load/store instructions for alignaddr_offset stored in the GSR 50. The graphics data 
the LSU 48, identifying the address spaces being accessed. expand/merge circuit 60 executes the graphics data merge/ 
The LSU 48 generates addresses for aU load and store 45 expand instnictions. The graphics data logical operation 
operations. "The LSU 48 also supports a number of load and circuit 62 executes the graphics data logical operations, 
store operations, specifically designed for graphics data. The functions and constitutions of the partitioned cany 
Memory references arc made in virtual addresses. The adder 37 are similar to simple carry adders found in many 
MMUs 44n-44b map virtual addresses 10 physical integer execution umis known in the art, except the hardware 
addresses. 50 are replicated multiple times to allow multiple additions/ 
There are many variations to bow the PDU 46, the IEU subtractions to be performed simultaneously on different 
30, the FPU 26, the integer and floating point register files partitioned portions of the operands. Additionally, the carry 
36 and 38, the MMUs 44a-44&, and the LSU 48, are coupled chain can be broken into two 16-bit chains. Thus, the 
to each other In some variations, some of these elements 46, partitioned carry adder 37 will not be further described. 
30, 26, 36, 38, 44o-44fc, and 48, may be combined, while in 55 Similarly, the functions and mnslirurioiis of the graphics 
other variations, some of these elements 46, 30, 26, 36, 38, data expand/merge circuit 60, and the graphics data logical 
«1 4 n 44b, and 48, may perform other functions. Thus, except operation circuit 62 are similar to expand/merge and logical 
formemcorj^ratcdtcactiingsof the p^seot rrjvcntion, these operation circuits found in many integer execution units 
elements 46, 30, 26, 36, 38, 44o-446, and 48, arc intended known in the an, except the hardware arc replicated multiple 
to represent a broad category of PDUs, UBUs, FPUs, integer 60 times 10 allow multiple expand/merge and logical operations 
and floating point register riles, MMUs, and LSUs, found in to be performed simultaneously on different partiuoned 
many graphics and non-graphics CPUs. Their constiiutioos portions of the operands. Thus, the graphics data expand/ 
and functions are well known and will not be otherwise merge circuit 60, and the graphics data logical operation 
described further. The teachings of the present invention circuit 62 will also not be further described, 
incorporated in these elements 46, 30, 26, 36, 38, 44i-44Jb, 05 The graphics data partitioned add/subtract and the graph- 
and 48, and ihc GRU 28 will be described in farther detail ics data alignment instructions, and the graphics data align- 
below. mcni circuit 39 win be described id further detail below. 
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Referring now to FJG- 5, a block diagram ilhisi rating the precision and dynamic range for storing intermediate data 

reWanl portion Of ooc crjihndimcni of the second parti- computed during filtering and other simple image manipu- 

tioncd execution paiii in further detail is shown. Id this lation operations performed on pixel data. Graphics data 

embodiment, the second partitioned execution path 34 com- * format conversions are performed using, the graphics data 

prises a pixel distance computation circuit 36, a partitioned 5 pack, expand, merge, and multiply instructions described 

multiplier 58, a graphics data packing circuit 59, and a below. 

34f^ercompri^ an un,berof rogis tc re 55 a -SSc,a P 4:l tTT^, ^ Regardless of the 

multiplexor 53, coupled to each olbeTand die previously lfl 1?™?™*°%? ^^^i.* 6 ^ <*» 

described elememsw shown. At eact. dispatch, the WW 4 10 f^^^^S ir*lruclioo formal 

_.„j:„.i„u -.-wi a-.,,^^^^..,;^!,',.™^^ identification, and bit<24:19] 74d-74c provide the seeond- 

maydisnalcb i e therap^l<teiaooscompmuoaiflsaucnon, formatidenuricaiioc forme nranhics instruc- 

a graphic* data partitioned multiplication instruction, a "jr «»""»»«> i?. 1,7* L 5, ^~T^ ^ 

gravies data pscWog instruction, or a graphics data com- ^ ^'T^' ha *° i ^ W> "^^^J t 

pare insiruciion W die second pardoned exicuiioo path 34. desr^atwn (third souro) re^^ 

The pbeel distance compulation circuit 56 executes tbe pixel V ^ F^T* ^Vl*" 1 ^ 

distance compulation instruction. The paninoned multiplier ? 6 *- 7 «f "dentil the first source register of the psphrc 

io ,,,, _„l,-„ j... _.-j5„„^i „„m„ij„ oh -™ mstrncdon. For the first graphics instruct too format 68a, 

ISLt Shtedar P a P ^lffi b*p 5 ] (orf) and ^4:^80 and 82* identify the op 

the eraphics data packing mslrncUoni The graphic data „ codes and the second s^ioe^reg^,^ for » gn^h^ uislme- 

compare circuit 64 executes the graphics data compare M ™ °{ tam «l.*" *? "£*t?* 

instructions instruction formats 68fc-68c, bits[13:5J (imm_asi) and bits 

The funciions and constitutions of the paninoned muni- ^M^^ "SS? ^IK, 

•c iro . . • « . r ... for the second graphics inalruchofj formal 680, bils[4:0j 

plicr 58 and the ^pbic* data compare cnrcuU 64 are Similar ( furlher pf0 ^ ^ seC oad source register for a^aph^ 
|o simple trnlUphcr^ ^compare cnrcmts ™^ many a ^ U*u*i™ of the format (or a mask foTa partial condi- 
mtcger execution units known in tbe art, except tbe hardware . , . N r 

arc replicated multiple times Lo allow multiple mulliplica- * ^ ^ , ____ _ _ 

Hons and comparison operations to be performed simulla- illustrated in FIG. 6c, the CPU 24 supports a number 

ncously on different partitioned portions of the operands. of GSR related instructions 200, a number of partitioned 
Additionally mulHple mult^ 30 add/subtract^ntiltiplication instructions 202 and 208, a num- 

lirioned multiplier for rounding, and comparison masks are of graphics data ahgnnient instructions 204, a number of 

generated by the comparison circuit 64. Thus, tbe partitioned I"* el disiance compulation instructions 206, a number of 
multiplier 58, and the graphics data compare circuit 64 will graphics data pack^xpand/merge instructions 210 and 212, 
not be further described. a rtunb&f of graphics data logical and compare instructions 

TTie pixe] distance computation, graphics data partitioned 3 s 214 216 ' * * *f handling and 3-D array 

nn^hcation, graphics data pack/expanc7mer fi e, graphics msmicLon£ ^ *** ^ 311(1 a aumb6r of mcmor y 

data logical operation, and graphics data compare access mslnictions 222. 

instructions, the pixel distance circuit 56, and the graphics The GSR related instructions 200 include a RDASR and 
data pack rinzuit 59 will be described in fmther detail below. a WRASR instruction for reading and writing the 

While the present invention is being described with an 40 aHgnaddr_oiTset and the scale_tactor from and into the 
embodiment of the GRU 28 having two independent parti- GSR 5 ?- ^ ^/^V^J* 1 ^* ^^^^ ^ 
tioned execudon paths, and a particular allocation of graph- executed by ldc IBU 30. The RDASR and WRASR iostruc- 
ics instruction execution responsibihdes among the execu- ? 0ttS ^ ***** 5*? ^ ntto1 re ^ ^ad/wntc 

b'on paths, based oa the descriptions to follow, it will be mstnicbons, thus will not be further, described, 
appreciated that the present invention may be practiced with 45 Toe graphics data partitioned add/subtract instructions 
one or more independent partitioned execution paths, and 202 include four partitioned graphics data addition instnxc- 
tbe graphics instruction execution responsibilities allocated tions and four partitioned graphics data subtraction inslruc- 
in any number of manners. tions tor simultaneously adding and subtracting four 16-bit, 

Referring, now to F10S.6a-6c, three diagrams illusiraring ^0 16*bii, two 32-bit, and one 32-bit graphics data res- 
the graphics data formats, tbe graphics instrncrjon formats, so lively. These instructions add or subtract the corresponding 
and the graphics instructions are shown As DlusLnUed in fixed point values in the esl and rs2 registers, and corrc- 
FIG, 6>, the exemplary CPU 24 supports three graphics data spoudingly place the result* in the rd register. As described 
formats, an eight bit format (Pixel) 66a, 9 16 bit format earlier, the graphics data partitioned add/subtract insfnic- 
(l?Ixedl6) 66b 9 and a 32 bit format (Rxcd32) 66c. Thus, four tions 202 are executed by the partitioned carry adder 37 in 
pixel formatted graphics data arc stored in a 32-bit word, ss me first independent execution path 32 of the GRU 2S. 
66a whereas either four Fixcdl6 or two Fixcd32 formatted The graphics data partitioned multiplication instructions 
graphics data are stored in a 64-bit word 66b or 66c. Image 208 include seven partitioned graphics data multiplication 
components are stored in either tbe Pixel or the Fixedl6 instructions for simultaneously multiplying either two ox 
formal 66a or 66b. Intermediate results arc stored in cither four 8-bit graphics data with another two or four correspond- 
the FixedW or the Fixed32 formal 666 or 6c. Typically, the 60 in$ 16-bit graphics data. A fMULb*16 instruction raulti- 
jrjtensity values oTa pixel of an image, e^. tbe alpha, green, plies four 8-bit graphics data in (be n>l register by four 
blue, and nx) values (a, G, B, R) T are stored in the Pixel corresponding 16-bii graphics data in the rs2 register. For 
format 66a, These intensity values may be stored in band each product, the upper 16 bits arc stored in the correspond- 
interleaved where the various color components of a point in lug positions of ihe rd register. A FMULSxlGAU and a 
the image are stored together, or band sequential where all 65 FMOL&xlGAL instruction multiplies the four 8-bit graphics 
of the values for one component arc stored together. The data in the rsl register hy the upper and die lower halves of 
Fixcdl6 and Rxed32 formats 66b~66c provide enough the 32-bii graphics data in the rS2 register respectively. 
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Similarly, tor each product, ihc upper 16 bits are stored in 32-bit graphics data. The comparisons between the graphics 

the corresponding posjijocs of the rd register, data in the rsl and rs2 registers include greater lhan, less 

A FMUI8SUxl6 instruction multiplies the four upper than, not equal, and equal. Four or two result bits are stored 

8-bits n( (he four 16-bit graphics dala in the rsl register by in the least significant bits in the rd register. Each result bit 

the four corresponding 16-bh graphics data in the rs2 5 is set if the corresponding comparison is true. Complimen- 

rcgister. Likewise, for each product, the Upper 16 bits are la»y comparisons between the graphics data, i.e., less than or 

stored in the corresponding positions of the rd register A equal to, and greater than or equal io, are performed by 

FMUL8UL:<16 instruction multiplies the four lower 8-bits swapping the graphics data in the ml and rs2 registers. As 

of the ftror 1 6-bii graphics data fn the rsl register by the four described earlier, these graphics data compare instructions 

corresponding 16-bil graphics data in the rs2 register. For JD 216 are executed by the graphics data compare circnii 62 in 

each product, the sign extended upper 8 bits are stored in the Ihc first independeni execution path 32 of the GRU 28. 

corresponding posidons of the rd regisier. The graphics data memory reference instructions 222 

A FMULDSSUxl6 instruction multiplies the two upper include a partial (conditional) store, a short load, a short 

8-bits of the rwo 16-bit graphics data in ibe rsl register by store, a block load and a block store instruction. The 

the two corresponding 16-bit graphics data in the rs2 rcg- 15 graphics data Joad and store Instructions are qualified by the 

isier. For each product, the 24 bits are appended with 8-bit imm^asi and asi values to determine whether the graphics 

of zeroes and stored in the corresponding positions or Lhe rd data load and store instructions 144 and 146 are to be 

register. A I^MULDSUT.^16 instruction multiplies the two performed simultaneously on 8-bit graphics data, 16-bit 

lower 8-bits of the two 16-bit graphics dala in the rsl register graphics data, and whether the operations are ^»r^^H 

by the two corresponding 16-bit graphics data in the rs2 20 towards the primary or secondary address spaces in big or 

register. For each product, the 24 bits are sign extended and little endian formal. For the store operations, the imm__asi 

Stored in the corresponding posidons of the rd register. values further serve to determine whether the grapb- 

As described earlier, the graphics data panitipned multi- ^ dau store G P erat >°os are conditional, 

plication instructions 208 are executed by the partitioned A partial (conditional) store operation stores the appro- 

nrultiplicr 58 in the second independent execution patb 34 of 25 priatc number of values from the rd register to the addresses 

the GRU 28. specified by the rsl register using the mask specified (in Ibe 

The graphics dala expand and merge instructions 210 152 W * location). Mask has lhe same format as the result* 
include a graphics data expansion instruction and a graphics generated by the pixel compare instructions. The most 
daia merge instruction, for simultaneously expanding tour 30 si S nificam Dil of * e m *- sk corresponds to the most signili- 
8-bit graphics data into four 16-bii graphics dala, and cant part of the rsl register. A short 8-bit load operation may 
interleaving] y merging eight 8-bit graphics data into four *™ Panned against arbitrary byle addresses. For a short 
16-bil graphics data respectively. A FEXPAND instruciion 1643it load operation, the least significant bit of lhe address 
takes four 8-bit graphics data in the rs2 register, left shifts mVLh [ ^° zc ^°- lo ^ s 26110 extended to fill the entire 
each 8-bit graphics data by 4 bits, and then zero-extend each 35 floalin £ P°mt destination register. Short stores access either 
left shifted jrraphics data to 16»biis. The results arc corre- lcw ° n ^ er 8 or 16 bits of me floating point source register, 
spondirjgly placed In the rd register. A FPMERGE ins true- A block load/store operation transfers data between 8 con- 
lion interleavingly merges four 8-bit graphics data from the ti$uous 64-bit floating point registers and an aligned 64-byte 
rsl register a nd four 8-h\t graphics data Irom the rs2 register, bIock in mcr nory. 

into a 64 bit graphics datum in the rd regisier. As described A0 -As described earlier, these graphics data memory refer- 

earlier, lhe graphics da4a expand and merge instructions 210 ence instniclions 222 are executed by the LSU 48 of the 

are executed by tbe expand/merge portions of the graphics CTU 24 

data expand/merge drcuir 60 in the first independent execu- The graphics data alignment instructions 204, the pixel 

tion path 32 of the GRU 28. distance computation instructions 206, the graphics data 

The graphics data logical operation instructions 214 45 P ac ^ instructions 212, the edge handling instructions 218, 

include thirty-two logical operation instructions for per- ar id the 3-D array accessing instructions 220 will be 

forming logical operations on graphics data. Four logical described in further detail below in conjunction with the 

operations are provided for zeroes filling or ones filling the P™* distance compulation circuit 56 and the graphics data 

rd register in either single or double precision. Four logical P*^ circuii 59 in the second independent execution path 34 

Operation instructions arc provided for copying the content 50 °^ ^ e GRU 28. 

of cither the rsl or r$2 register into the rd register in cither Referring now to FIGS. 7c-7c, the graphics data align- 

single or double precision. Four logical operation ins true- meat instructions, and the relevant portions of one embodi- 

tions are provided for negating the content of either the rsl men! of the graphics data alignment circuit are illustrated. As 

or rs2 regisier and storing the resuli into the rd register in shown in FIG. 7a„ there arc two graphics data address 

either tangle or double precision. Some logical operations 55 calculation instructions 99a-98b. and one graphics d ata 

arc provided to perform a number of Boolean operations alignment instruction 100 for calculating addresses of mis- 

agaiast the corneal of the rsl and rs2 registers in either single aligned graphics data, and aligning misaligned graphics data 

or double precision, and storing the Boolean results into the The A1JGNADDR instruction 98a adds the content of the 

rd register. Some of these Boolean operations are performed rsl and rs2 registers, and stores the result, except the least 

after having either lhe content of the rsl or the ra2 regisier 6D significant 3 biis are forced to zeroes, in the rd register The 

negated first. As described curlier, mese graphics dala logical least significant 3 bits of the result are stored io the 

operation instructions 214 are executed by the graphics daia a]ignaddr_ofi3et field of GSR SO. The AUGNADDRL 

logical operation circuit 62 in the first independeni execution instruction 98o is the same as the alignaddr instruction 96\7, 

path 32 of the GRU 28. except twos complement of the least significant 3 bits of the 

The graphics data cornpqrc instructions 216 include eight 65 result is stored in the alignaddr_oflset field of GSR 50. 

graphics data compare instructions for simulianeously com- The FAOGNDATA instruction 100 concatenates rwo 

paring four pairs of 16-bit graphics data or two pairs of 64-bit floating point values in the rsl and rs2 registers to 
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form a 16-byte value. The flouting point value in the r*l 
register is used as the upper half of the concatenated value, 
whereas the floating point value in the rs2 register is used as 
the lower half of the concatenated value. Bytes in xbc 
concatenated value are numbered from the most significant 
byte to the least significant byte, with the most significant 
byte being byte 0. Eight bytes are extracted from the 
concatenated value, where the most significant byte of the 
extracted value is the byte whose number is specified by the 
aligoaddr . offset field of GSR 50. The result is stored as a 
64 bit floating point value in the rd register. 

Thus, a illustrated in FIG- 7fc, by using the ALIGNAD- 
DRESS {UTTI £} instruction to generate and store the 
ahgnaddr_ofi£sct in the GSR 50 (step a), copying the two 
portions of n misaligned graphics data block 99a-99b from 
memory into the rsl and rs2 registers, aligning and storing 
the aligned graphics data block into the rd register using the 
FAU GN DATA instrucrioo, and then copying the aligned 
graphics data block 1<U from the rd register into a new 
memory location, a misaligned graphics data block 99a-99b 
can be aligned in a quick and efficient manner. 

As shown in FIG, 7c, in (his embodiment, the graphics 
data alignment circuit 39 comprises a 64-bit multiplexors 
SI, coupled to each other and the floating point register file 
as shown* The multiplexor 51 aligns misaligned graphics 
data as described above. 

Referring now to FIGS. &a-&g r the graphics data packing 
instructions, and the relevant portions of the packing portion 
of the graphics data pack/expand/merge circuit arc illus- 
trated. As illustrated in FIGS. %o-%d > there are three graphics 
data packing instructions 10fa-10fe, for simultaneously 
packing four ld-bil graphics data into four 8-bit graphics 
data, two 32-bit graphics data into two 8-bit graphics data, 
and two 32-bit graphics data into two 16-bii graphics data. 

The FPACK16 instruction 106a takes four 16-bit fixed 
values in the rs2 register, left shifts them in accordance to the 

scale factor in GSR SO and maintaining the clipping 

information, then extracts and clips 8-bit values starling at 
the corresponding immediate bits left or the implicit binary 
positions (between bit 7 and bit 6 of each 1 6-bit vahie). If the 
extracted value is negative (Le., msb is set), zero is delivered 
as the clipped value. If the extracted value is greater than 
255, 255 is delivered. Otherwise, the extracted value is the 
final result. The clipped values are correspondingly placed in 
the rd register. 

The FPAirK32 instruction 106b takes two 32-bit fixed 
values in ihe rs2 register, lefts shifts them in accordance to 
the scilc_Fiictor in GSR 50 and maintaining the clipping 
information, then extracts and clips 8-bit values starting at 
the immediate bits left of tbe implicit binary positions (Le.„ 
between bit 23 and bit 22 of a 32-bii value). For each 
extracted value, clipping is performed in tbe **mt> manner as 
described culicr. Additionally, the f PACK32 instruction 
106b left shifts each 32-bit value in the rsl register by 8 bits. 
Finally, the FPACK32 instruction 106*6 correspondingly 
merges the dipped values from the rs2 register with the 
shifted values* from the rs2 register, with the clipped values 
occupying die least significant byte positions. Tne resulting 
values are correspondingly placed in the rd register. 

The FPACKFTX instruction 106c takes two 32-bii fixed 
values in the rs2 register, left shifts each 32-bjt value in 
accordance lo the scalc_Jactor in GSR 50 mamlaining the 
clipping mformation, then extracts and clips 16-bit values 
starring ai the immediate bits left of the implicit binary 
positions (i.e., between bit 16 and bii 15 of a 32-bit value). 
If the extracted value is less than -32768, -3276S is 
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delivered as the clipped value. If the extracted value is 
greater than 32767, 32767 is delivered. Otherwise, the 
extracted value is the final result The clipped values are 
correspondingly placed in the rd register. 
* As illustrated in FIGS. &e-8g t in this embodiment, the 
graphics data packing circuit 59 comprises circuitry 248, 
258 and 268 for executing the FPACJC16, FPACK32, and 
FFACKFIX instructions respectively. 
The circuitry 248 for executing the FPACK16 instruction 

10 comprises four identical portions 24€fl-240rf, one for each 
of the four corresponding 16-bit fixed values in the rs2 
register. Each portion 240a or 2404 comprises a shifter 
242a, ... or 2424, an OR gate 244a, ... or 244J, and a 
multiplexor 2A6a f ... or 246rf, coupled to each other as 
shown. The shifter 242j, ... or 2424 shifts tbe correspond- 
ing 16-bit fixed value (excluding the sign bit) according to 
the scale factor stared in the GSR 50. Tne sign bit and the 
logical OR of bits [29: 15] of each of the shift results a re used 
to control the corresponding multiplexor 246a, ... or 2464. 

20 Either bits [14:7] of the shift result, the value OxFF or the 
value 0x00 are output 

The circuitry 258 for executing the FPACK32 mstructioa 
comprises two identical portions 250o—2506, one for each of 

^ the two corresponding 32-bit fixed values in the ts2 register. 
Each portion 250a or 2S0.fr also comprises a shifter 252a or 
252 d, an OR gate 254a or 2546, and a multiplexor 256a or 
2566, coupled to each other as shown. Tbe shifter 252a or 
252d shifts the corresponding 32-bit fixed value (excluding 
the sign bit) according to the scale factor stored in the GSR 
50. The sign bit and the logical OR of bits [45:31] of each 
of the shift results are used to control the corresponding 
multiplexor 256a or 256b, Either bils [30-J23] of the shift 
result, the value OxFP or the value 0x00 are output The 

^ output is further combined with either bits [55:32] or bits 
[23:0] of the rsl register. 

The circuitry 268 for executing the FPACKFIX instruc- 
tion also comprises two identical portions 260*7—2606, one 
for each of the two corresponding 32-bit fixed values in the 

40 rs2 register. Each portion 260a or 26Qb also comprises a 
shifter 262a or 2624, a NAND gate 263a or 2636, a NOR 
gate 264a or 2646, two AND gates 265a-265b or 
265c-265a\ and a multiplexor 266a Or 2666, coupled to each 
other as shown. The shifter 262a or 262d shifts the corrc- 

45 sponding 32-bil fixed value (excluding the sign bit) accord- 
ing to the scale factor stored in the GSR 50. The logical 
AND of the sign bit and the logical NAND of bits [45:32] 
of each of the shift result*, and the logical AND of the 
inverted sign bit and the logical NOR of bits [45:32] of each 

50 of the shift results, are used to control the corresponding 
multiplexor 266a or 266b. Either bits [31:16] of the shift 
result, the value OxEFFF or the value 0x8000 are output 

Referring now lo FlGS- 9fl- 0 t>, the pixel distance com- 
putation instructions, and the pixel distance compulation 

55 circuit arc illustrated. As shown in FIG. 9a, there is one 
graphics data distance computation instruction 136 for 
simultaneously accumulating the absolute differences 
between graphics data, eight pairs at a time. The PDIST 
instruction 138 subtracts eight 8-bit graphics dala in the rsl 

$0 register from ejght corresponding 8-bit graphics data in the 
rs2 register. The sum of the absolute values of the differences 
is added to the content of the rd register. The PDIST 
instruction is typically used for motion estimation in video 
compression algorithms. 

65 As shown in FIG. 96, in this embodiment, the pixel 
distance computation circuit 36 comprises eight pairs of 8 
bit suhtractars 57o-57A. Additionally, the pixel distance 
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computation circuit 56 further comprises three 4:2 carry 
save adders 61a— 61c, a 3:2 carry save adder 62, Two registers 
630-636, and a 11-bit carry propagate adder 65, coupled to 
each other as shown. The eight pairs of 8 oil subtraciors 
57a-57Ji, the three 4:2 carry save adders 61al4 61c, the 3:2 
Carry save adder 62, the two registers 63a— 63b, and the 
11 -bit Carry propagate adder 65, cooperate to compute the 
absolute differences between eight pairs of 8-bit values, and 
aggregate the absolute differences into a 64-bh sum. 

Referring now to FIGS. IOo-IOd, the graphics data edge 
handling instructions arc illustrated. As illusirated, there are 
six graphics edge handling instructions 140A-14Q/; lor 
simultaneously generating eight 8-bit edge masks, four 
16-bii edge masks, and two 32-bit edge masks in big or little 
crjdian format 

The masks are generated in accordance Id the graphics 
data addresses in the rsl and r$Z registers, where the 
addresses of the next series of pixels to render and the 
addresses of the last pixels of the scan line are stored 
respectively The generated masks are stored in the least 
Significant bits of the rd register. 

Each mask is computed from the left and right edge masks 
as follows: 

a) The Mt edge mask is computed from the 3 least 
significant bits (LSBs) of the rsl register, and the right 
edge mask is computed from the 3 (LSBs) of the rs2 
register in accordance to FIG. 10b. 

b) If 32-bit address masking is disabled, i.e. 64-bit 
addressing, and the upper 61 bits of the rsl register are so 
equal ti> the corresponding hits of the rs2 register, men 
rd is set equal to the right edge mask ANDed with the 
left ed$e mask. 

c) if 32- bit address masking is enabled, i.e. 32-bit 
addressing, and ihe upper 29 bits ([26:2]) ihc rsl 35 
register are equal to the corresponding bits of the rs2 
register, then the rd register is set to the right edge mask 
ANDed with the left edge mask. 

d) Otherwise, rd is set to the left edge mask. 
Additionally, a number of conditions codes are modified 40 

as follows: 

a) a 32-bit overflow condition code is sex if bit 31 (the 
sign) of rsl and rs2 registers differ and bii 31 (the sign) 
or the difference differs from bit 31 (the sign) of rsl; a . 
64-bit overflow condition code is set if bit 63 (the sign) 3 
of rsl a nd rs2 registers differ and bit 63 (tbe sign) of the 
difference differs from bit 63 (the sign) of rsl. 

b) a 32-bil negaiivc condition code is set if bit 31 (the 
sign) of die difference is set; a 64-bit negative condition s $ 
code is set if bit 63 (the sign) of the difference is set. 

. c) a 32-bit zero condition code is set if the 32-bit differ- 
ence is 2erO; a 64-bit zero condition code is set if the 
64-bit difference is zero. 

As described earlier, the graphics edge handling ins true- 55 
lions 140a-14v/ are executed by the IEU 30. No additional 
hardware is required by IEU 30. 

Referring now to FIGS, llc-llo, the 3-D array address- 
ing instructions and circuitry are illustrated. As illustrated in 
FIG. 11a, there are three 3-D array addressing instructions so 
142*7-l42c for converting 8-bil, 16-bit, and 32-bit 3-D 
addresses to blocked byte addresses. 

Each of These instructions 142a-142c converts 3-D fixed 
point addresses in the rs3. register to a blocked byte address, 
and store the resulting blocked byre address in the rd 65 
register. These instructions I42a-142c are typically used tor 
address interpolation for planar reformatting operations. 
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Blocking is performed at the 64-byte level io maximize 
external cache block reuse, and at the 64k-bylc level to 
maximize the data cache's translation lookaside buffer 
(TLB) entry reuse, regardless of the orientation of the 
5 address interpolation. The element siue, ije., ft-bfts, 16-bits, 
or 32-bit, is implied by the instruction. The valve of the rs2 
register specifies the power of two sizes of the X and Y 
dimension of a 3D image array. In the embodiment 
illustrated, the legal values arc from zero to five. A value of 
10 zero specifies 64 elements, a value of one specifies 128 
elements, and so on up to 2048 elements for the external 
cache block size specified through the value of five. The 
integer parts of X, Y, and Z (rsl) are convened to either the 
S-bit, the 16-bit, or the 3 2- hit format. The bits above Z upper 
35 are set to zero- The number of zeros in the least significant 
bits is determined by the element size. An element si2e of 
eight bits has no zero, an clement size of 16-bits has one 
zero, and an element size of 32-biis has two zeroes. Bits in 
X and Y above the size specified by the r$2 register is 
20 ignored. 

As described earlier, the 3-D array addressing instructions 
142a-140e are also executed by the IEU 30. FIG. 116 
illustrates one embodiment of Ihe additional circuitry pro- 
vided to tbe IEU 30. Tbe additional circuitry 300 comprises 
25 two shift registers 308 and 310, and a number of multiplex- 
ors 304n-304fr and 306, coupled to each other as shown. The 
appropriate bits from the lower and middle integer portions 
Of X, Y, and Z (i.e. Wts<l2:ll>, <34:33>, <55>, <16:13>, 
<3$:35>, and <59:56>) arc first stored into shift register a 
308. Similarly, the appropriate bits of the upper integer 
portion of Z (i.e. <63:60>) are stored into shift register B 
310. Then, selected bits of the upper integer portions of Y 
and X arc shifted into shift register B 310 in order, depend- 
ing on the value of rs2, Finally, zero, one, or two zero bits 
are shifted into shift register A 308, with the shift out bits 
shifted into shift register B 310, depending on the array 
element siste (i.e. 8, 16, or 32 bits). 

While tbe present invention has been described in terms 
of presently preferred and alternate embodiments, those 
skilled in the ait will recognize that the invention is not 
limited to the embodiments described. The method and 
apparatus of the present invention can be practiced with 
modification and alteration within the spirit and scope of the 
appended claims. The description is tons to be regarded as 
illustrative of. and not limiting the scope of the present 
invention, 
What is claimed is: 
1. A microprocessor comprising: 
an instruction fetch and dispatch unit; 
at least two pipelined execution units connected in par- 
allel to said instruction fetch and dispatch unit, includ- 
ing 

integer execution logic, 
floating point execution logic, and 
graphics execution logic; 
a first register file coupled solely to said integer execution 
logic and storing integer operands and results of Opera- 
tions performed in said integer execution logic; and 
a second register file coupled solely to said floating point 
execution logic and said graphics execution logic, said 
second resist er file storing floating point and graphics 
. operands and results of operations pe rform ed in said 

floating point and graphics execution logic, 
wherein said graphics execution logic comprises first and 
second graphics execution units, each separately 
coupled to said rrjsiructioo fetch and dispatch unit; 
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wherein said first graphics execution unit includes an 
ALU, and said second graphics execution unit includes 
. a multiplier; 

wherein said second graphics execution unit further 
includes u pixel distance computation circuit config- S 
ured to calculate and accumulate the difference 
between multiple pairs of pixels, said pixel distance 
computation circuit and said multiplier being config- 
ured in parallel such that only one can receive a 
decoded instruction from said fetch and dispa ten unit in 10 
a given clock cycle. 

2. The au Coprocessor of claim 1 wherein said pixel 
distance computation circuit comprises: 

a sub tractor configured to subtract multiple pixel values in 
parallel; and * * 15 

a plurality of adders for providing a total absolute value 
sum of the subtraction operations on said multiple 
pixels. 

3. A microprocessor comprising: 

an instruction fetch and dispatch unit; 

at least two pipelined execution units connected in par- 
allel to said instruction fetch and dispatch unit, includ- 
ing 

integer execution logic, 25 

floating point execution logic, and 
graphics execution logic; 

a first register file coupled solely to said integer execution 
logic and storing integer operands and results of opera- 
tions performed in said integer execution logic; and 30 

a second register file coupled solely 10 said floating point 
execution logic and said graphics execution logic, said 
second register file storing floating point and graphics 
operands and results of operations performed in said 
floating point and graphics execution logic; 

wherein said graphics execution logic comprises first and 
second graphics execution units, each separately 
coupled to said instruction fetch and dispatch unit; 

wherein said first graphics execution unit includes an 
ALU, and Said second graphics execution unii includes 
a multiplier; 

wherein said second graphics execution unit further 
includes a pixel packing circuit configured to pack 
N-bit pixels into an M-bit format, where M is less than 45 
N, said pixel packing circuit being in parallel with said 
multiplier. 

4. A microprocessor comprising: 

an instruction fetch and dispatch unit; 

et least two pipelined execution units connected in par- 
allel to said instruction fetch and dispatch unit, includ- 
ing 

integer execution logic, 
floating point execution logic, and 
graphics execution logic; 
a first register file coupled solely to said integer execution 
logic and storing integer operands and results of opera- 
tions performed in said integer execution logic; and 
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a second register file coupled solely to said floating point 
execution logic and said graphics execution logic, said 
second register file storing floating point and graphics 
operands and results of operations performed in said 
floating point and graphics execution logic; 
wherein said graphics execution logic comprises first and 
second graphics execution units, each separately 
coupled to said instruction fetch and dispatch unit; 
wherein said first graphics execution unit includes an 
ALU, and said second graphics execution unit includes 
a multiplier, 

wherein said firs! graphics execution unit further includes 
a graphics alignment circuit in parallel with said ALU. 

5. The microprocessor of claim 4 further comprising a 
graphics status register accessible by said first and second 
graphics execution units, and wherein said graphics align- 
ment circuit comprises a multiplexer having inputs coupled 
to first and second registers in said floating point register 
files, and a control selection input circuit coupled to said 
graphics status register. 

6. A microprocessor comprising: 
an instruction fetch and dispatch unit; 
at least two pipelined execution units connected in par- 
allel to said instruction fetch and dispatch unit, includ- 
ing 

integer execution logic, 
floating point execution logic, and 
graphics execution logic; 
a first register file coupled solely to said integer execution 
logic and sioring integer operands and results of opera- 
tions performed in said integer execution logic; and 
a second register file coupled solely to said floating point 
execution logic and said graphics execution logic, said 
second register file storing floating point and graphics 
operands and results of operations performed in said 
floating point and graphics execution logic; 
wherein said microprocessor includes a cache memory, 
and said integer execution unit further comprises a 
dedicated block address conversion circuit, distinct ' 
from other integer operation circuitry, for converting 
pixel addresses from a 3D format having X, % and Z 
coordinates linearly set forth in an address to a blocked 
byte format having addresses with a less significant 
portion of said X, Y and Z coordinates followed by a 
more significant portion of said X, Y and Z coordinates. 

7. The microprocessor of claim 6 wherein said blocked 
byte format further comprises a most significant portion of 
said X, Y and Z coordinates following said more significant 
portions, such that said blocked byte address consists of a 
low, middle and high portion of the X, Y and Z coordinates. 

8. Tbc microprocessor of claim 7 wherein said k>w 
portion corresponds to a cache hue. 

9. The microprocessor of claim 7 wherein all addresses 
specified by said middle portion correspond to a single page 
of an address for said microprocessor. 
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X. Related Proceedings Appendix: Copies of Decisions Rendered bv a Court or the 
Board in any Prior and Pending App eals. Interferences or Judicial Proceedings 

There are no related appeals or interferences to appellant's knowledge that would 
have a bearing on any decision of the Board of Patent Appeals and Interferences. 
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